Clean updating ondemand #2

NicolasJiaxin · 2021-08-05T16:59:09Z

Duplicate of #1.
Clean version of update.

NicolasJiaxin · 2021-08-05T20:40:48Z

@lemire The main problem that I am having is that the get(T out) method does not support T type as simdjson::ondemand::value. I could not find any nice way around, maybe you have a solution?

lemire · 2021-08-05T21:58:37Z

Getting back to you tomorrow!

lemire · 2021-08-06T14:58:15Z

@NicolasJiaxin Let me try to understand the issue.

lemire · 2021-08-06T15:01:26Z

BTW your struggles are exactly the point. :-) I want you to find all the confusing parts so we can discuss and see if we can improve our API later. Or, at least, the documentation.

lemire · 2021-08-06T15:04:00Z

I am having is that the get(T out) method does not support T type as simdjson::ondemand::value. I could not find any nice way around, maybe you have a solution?

Here is your own code in simdjson (tests):

   bool run_success_test(const padded_string & json,std::string_view json_pointer,std::string expected) {
        TEST_START();
        ondemand::parser parser;
        ondemand::document doc;
        ondemand::value val;
        std::string_view actual;
        ASSERT_SUCCESS(parser.iterate(json).get(doc));
        ASSERT_SUCCESS(doc.at_pointer(json_pointer).get(val));
        ASSERT_SUCCESS(simdjson::to_json_string(val).get(actual));
        ASSERT_EQUAL(actual,expected);
...

As you can see, you do doc.at_pointer(json_pointer).get(val) where val is an instance of value.

lemire · 2021-08-06T15:06:51Z

inst/include/RcppSimdJson/deserialize.hpp

-        if(parsed.at_pointer(std::string_view(query)).get(queried) == simdjson::SUCCESS) {
-            return deserialize(queried, parse_opts);				// #nocov
+        auto queried = parsed.at_pointer(std::string_view(query));
+        if(queried.second == simdjson::SUCCESS) {


This troubles me that you are able to do queried.second. I thought we made this impossible..

The syntax is supposed to be...

ondemand::value queried; simdjson::error_code error = parsed.at_pointer(std::string_view(query)).get(queried); if(queried ==...}

Does that no work?

That is what I had in the before, but if you go back to yesterday's commit e91ee40, I still have an error with regard to the get() method which says that it is not implemented with the given type (simdjson::ondemand::value). And also, when I look in the documentation, it says that is should not be supported, so I was surprised to see it work in our own tests.

I'll investigate right now. Meanwhile please don't use first/second. We don't want end users to use it. It is confusing and error prone.

Do you want me to revert to commit e91ee40 right now?

I just finished producing a PR directly on simdjson so that it will no longer possible to use first/second. I don't have yet a good idea of the issue you are encountering. Let have a look first.

I'll need more coffee too!

Ok I will revert then, so this is not broken code.

lemire · 2021-08-06T15:07:40Z

You should not be able to use .first and .second. Let me investigate.

lemire · 2021-08-06T15:34:38Z

Somehow one is able to access first and second. I am investigating. It should not be possible.

NicolasJiaxin · 2021-08-06T15:51:28Z

We can revert to commit e91ee40 once that permission is removed where I used get instead of second. However, it still have the error that I have mentioned.

NicolasJiaxin · 2021-08-20T15:33:11Z

You probably know this, but there is a complete log of the tests and the failed tests in RcppSimdJson.Rcheck/tests/tinytest.Rout.fail when you run the tests locally.

lemire · 2021-08-20T16:27:51Z

I realize this but thanks for the reminder. I have not managed to get to it today, more later.

lemire · 2021-08-21T14:33:43Z

@NicolasJiaxin Let us try to improve support for integers. Could you have a look at my proposal at simdjson/simdjson#1703

?

NicolasJiaxin · 2021-08-28T16:56:21Z

@lemire It works! As expected, all failed tests (except one) were related to numbers issue. The only test that was still failing was a test with valid_json, but I changed it because I think this is another instance of On Demand that does not know that the JSON is invalid (yet), but I think you should check it out to be sure it is ok to change it.

NicolasJiaxin · 2021-08-28T16:59:09Z

inst/tinytest/test_simdjson_utils.R

-expect_false(any(is_valid_json(valid_utf8)))
+# Change to expect_true since valid json is only detected when parsed/accessed.
+#expect_false(any(is_valid_json(valid_utf8)))
+expect_true(any(is_valid_json(valid_utf8)))


NicolasJiaxin · 2021-08-28T17:00:01Z

inst/tinytest/test_fparse_fload.R

+if (FALSE) {
 .write_file("JUNK JSON", test_file1)
 .write_file('"VALID JSON"', test_file2)



Also, this one I removed.

lemire · 2021-08-29T00:06:12Z

@NicolasJiaxin That's fantastic. I think that's all we needed to do for the summer.

Would you do a PR from your repo to the eddelbuettel/rcppsimdjson repo, while explaining your work? Label it clearly as a prototype and explain a bit what you did. This way people will be able to build on your work (if they choose to do so).

NicolasJiaxin · 2021-08-30T21:44:55Z

A few remarks regarding On Demand after doing this work:

It would be nice to have something like object.count_elements() like for arrays (I am working on it).
Currently, we have is_integer() and is_negative() to identify the type of a number inside a ondemand::value without parsing it. However, unless I am mistaken, we cannot tell apart int64 and uint64. If possible, something like is_large_integer() would help.

NicolasJiaxin · 2021-08-30T21:47:04Z

I will close this PR as I have opened one to the main repo here.

lemire · 2021-08-30T22:16:50Z

It would be nice to have something like object.count_elements() like for arrays (I am working on it).

I have opened an issue.

Currently, we have is_integer() and is_negative() to identify the type of a number inside a ondemand::value without parsing it. However, unless I am mistaken, we cannot tell apart int64 and uint64. If possible, something like is_large_integer() would help.

Can you elaborate on how this might be used? I am concerned about double and triple parsing. It would be a terrible pattern to do is_integer() then is_large_integer(), then to_uint64(). This would literally mean scanning the string number three times.

Right now, you can check if the number is negative. This is fast. If it is positive, then you can do is_integer() and then do to_uint64(). I do not want anyone to ever do this... people should just do get_number()... but let us say they do, then it will work. Then you can look at the size of whatever was returned by to_uint64() and decide whether it is large.

I am not dismissing your proposal. I just want do understand it.

That is, we just don't want to throw new functions into the API. We want to keep the API as tight as possible. Adding more functions makes it harder to use. Now, if we had more functions that can be used in a counterproductive manner, we risk making things worse.

(I am not dismissing your proposal. To be sure.)

NicolasJiaxin · 2021-08-31T21:28:48Z

I thought that is_large_integer would be called only once instead of is_integer. It would validate that it is an integer and that it is large. This would be useful in the case of this wrapper at least when scanning a number for the first time. As I have mentioned, the design of the wrapper is to scan through the document once to see what are the types present, and then scan a second time to put those elements in an appropriate structure. The first time we scan, it is important to distinguish between int64, uint64 and double. However, with is_integer and is_negative we cannot distinguish between int64 and uint64, so I resorted to use ondemand::number. This meant I parsed the number without using it just to distinguish its type. I figured that it would probably be better/faster to scan once with something like is_large_integer without computing the actual value.

lemire · 2021-09-01T02:19:15Z

@NicolasJiaxin So let us say I have a number string ... 'xxxxxxxxx'. Now, I do not want you to ever scan the number twice. It seems to me that what you would do is something

check if it is is_large_integer, if not check if it is is_integer, if not conclude it is a float
then rescan a second time, calling get_uint64 or get_int64 or get_double

That's very bad because you could call is_large_integer, then is_integer, then get_double, thus scanning the input string three times.

Even just scanning the input string twice is really bad. I'd never want anyone to do it.

so I resorted to use ondemand::number.

Yeah. Parsing the numbers twice is not good.

NicolasJiaxin · 2021-09-01T19:49:53Z

@lemire Ahh... yes, I see what was your concern now. You are right, there is probably no useful point of having is_large_integer. It looks to me that for this design, the best approach is to scan numbers twice using ondemand::number. I can't seem to find a way around this...

lemire · 2021-09-01T23:03:37Z

Let me see if I can do a patch that solves this, somewhat.

NicolasJiaxin added 16 commits July 28, 2021 15:56

Update simdjson.cpp and simdjson.h.

ee69a36

Simplify.

bb6dd45

Scalar and typos.

27e6277

Type Doctor.

04c6c85

Vector.

e9238ef

Matrix.

1b1a969

Dataframe.

bd393f0

Deserialize.

f5a82d8

Typos.

ef42e59

Typos in vector.

c91288c

Typos and small fixes.

8965f95

Small attempt fix.

c874dd9

Attempt to fix qualifiers.

cbcd3a2

Other attempt.

eca9768

Attempt.

ae76beb

Temp fix.

563c00f

NicolasJiaxin added 3 commits August 5, 2021 17:09

Typo.

9a0f1ba

Only get issue left.

8cc1f83

Typo

e91ee40

lemire reviewed Aug 6, 2021

View reviewed changes

NicolasJiaxin force-pushed the clean-updating-ondemand branch 2 times, most recently from 176c1be to e91ee40 Compare August 6, 2021 17:18

Remove test failures.

f32b9a1

NicolasJiaxin added 12 commits August 26, 2021 17:03

Update simdjson files (numbers).

a4d0db9

Update new names.

6814aac

Update Type_Doctor.

e29dab4

Fix matrix. Add complete_json_type enum.

8a30053

Update dataframes deserialization.

68cf9fc

Update simdjson files.

1713a3a

Update numeric scalar documents.

1d8bd66

Fix integer scalar docs.

707c311

Fix utf8 test.

f2540f7

Forgot to save change before committing.

6a51dff

Fix test in examples fparse.

6d78d86

Fix again.

55fc1ae

NicolasJiaxin commented Aug 28, 2021

View reviewed changes

NicolasJiaxin added 2 commits August 30, 2021 14:40

Fix vector with int64.

2ffaef0

Minor simplification with int64.

e27303c

NicolasJiaxin closed this Aug 30, 2021

NicolasJiaxin mentioned this pull request Aug 30, 2021

[PROTOTYPE] Updating to simdjson On Demand (~version 1.0) eddelbuettel/rcppsimdjson#75

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean updating ondemand #2

Clean updating ondemand #2

NicolasJiaxin commented Aug 5, 2021

NicolasJiaxin commented Aug 5, 2021

lemire commented Aug 5, 2021

lemire commented Aug 6, 2021

lemire commented Aug 6, 2021

lemire commented Aug 6, 2021

lemire Aug 6, 2021

NicolasJiaxin Aug 6, 2021

lemire Aug 6, 2021

NicolasJiaxin Aug 6, 2021

lemire Aug 6, 2021

lemire Aug 6, 2021

NicolasJiaxin Aug 6, 2021

lemire commented Aug 6, 2021

lemire commented Aug 6, 2021

NicolasJiaxin commented Aug 6, 2021 •

edited

Loading

NicolasJiaxin commented Aug 20, 2021

lemire commented Aug 20, 2021

lemire commented Aug 21, 2021

NicolasJiaxin commented Aug 28, 2021

NicolasJiaxin Aug 28, 2021

NicolasJiaxin Aug 28, 2021

lemire commented Aug 29, 2021

NicolasJiaxin commented Aug 30, 2021

NicolasJiaxin commented Aug 30, 2021

lemire commented Aug 30, 2021

NicolasJiaxin commented Aug 31, 2021

lemire commented Sep 1, 2021

NicolasJiaxin commented Sep 1, 2021

lemire commented Sep 1, 2021

Clean updating ondemand #2

Clean updating ondemand #2

Conversation

NicolasJiaxin commented Aug 5, 2021

NicolasJiaxin commented Aug 5, 2021

lemire commented Aug 5, 2021

lemire commented Aug 6, 2021

lemire commented Aug 6, 2021

lemire commented Aug 6, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lemire commented Aug 6, 2021

lemire commented Aug 6, 2021

NicolasJiaxin commented Aug 6, 2021 • edited Loading

NicolasJiaxin commented Aug 20, 2021

lemire commented Aug 20, 2021

lemire commented Aug 21, 2021

NicolasJiaxin commented Aug 28, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lemire commented Aug 29, 2021

NicolasJiaxin commented Aug 30, 2021

NicolasJiaxin commented Aug 30, 2021

lemire commented Aug 30, 2021

NicolasJiaxin commented Aug 31, 2021

lemire commented Sep 1, 2021

NicolasJiaxin commented Sep 1, 2021

lemire commented Sep 1, 2021

NicolasJiaxin commented Aug 6, 2021 •

edited

Loading