Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean updating ondemand #2

Closed

Conversation

NicolasJiaxin
Copy link
Collaborator

Duplicate of #1.
Clean version of update.

@NicolasJiaxin
Copy link
Collaborator Author

@lemire The main problem that I am having is that the get(T out) method does not support T type as simdjson::ondemand::value. I could not find any nice way around, maybe you have a solution?

@lemire
Copy link
Owner

lemire commented Aug 5, 2021

Getting back to you tomorrow!

@lemire
Copy link
Owner

lemire commented Aug 6, 2021

@NicolasJiaxin Let me try to understand the issue.

@lemire
Copy link
Owner

lemire commented Aug 6, 2021

BTW your struggles are exactly the point. :-) I want you to find all the confusing parts so we can discuss and see if we can improve our API later. Or, at least, the documentation.

@lemire
Copy link
Owner

lemire commented Aug 6, 2021

I am having is that the get(T out) method does not support T type as simdjson::ondemand::value. I could not find any nice way around, maybe you have a solution?

Here is your own code in simdjson (tests):

   bool run_success_test(const padded_string & json,std::string_view json_pointer,std::string expected) {
        TEST_START();
        ondemand::parser parser;
        ondemand::document doc;
        ondemand::value val;
        std::string_view actual;
        ASSERT_SUCCESS(parser.iterate(json).get(doc));
        ASSERT_SUCCESS(doc.at_pointer(json_pointer).get(val));
        ASSERT_SUCCESS(simdjson::to_json_string(val).get(actual));
        ASSERT_EQUAL(actual,expected);
...

As you can see, you do doc.at_pointer(json_pointer).get(val) where val is an instance of value.

if(parsed.at_pointer(std::string_view(query)).get(queried) == simdjson::SUCCESS) {
return deserialize(queried, parse_opts); // #nocov
auto queried = parsed.at_pointer(std::string_view(query));
if(queried.second == simdjson::SUCCESS) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This troubles me that you are able to do queried.second. I thought we made this impossible..

The syntax is supposed to be...

ondemand::value queried;
simdjson::error_code error = parsed.at_pointer(std::string_view(query)).get(queried);
if(queried ==...}

Does that no work?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what I had in the before, but if you go back to yesterday's commit e91ee40, I still have an error with regard to the get() method which says that it is not implemented with the given type (simdjson::ondemand::value). And also, when I look in the documentation, it says that is should not be supported, so I was surprised to see it work in our own tests.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll investigate right now. Meanwhile please don't use first/second. We don't want end users to use it. It is confusing and error prone.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to revert to commit e91ee40 right now?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just finished producing a PR directly on simdjson so that it will no longer possible to use first/second. I don't have yet a good idea of the issue you are encountering. Let have a look first.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need more coffee too!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I will revert then, so this is not broken code.

@lemire
Copy link
Owner

lemire commented Aug 6, 2021

You should not be able to use .first and .second. Let me investigate.

@lemire
Copy link
Owner

lemire commented Aug 6, 2021

Somehow one is able to access first and second. I am investigating. It should not be possible.

@NicolasJiaxin
Copy link
Collaborator Author

NicolasJiaxin commented Aug 6, 2021

We can revert to commit e91ee40 once that permission is removed where I used get instead of second. However, it still have the error that I have mentioned.

@NicolasJiaxin NicolasJiaxin force-pushed the clean-updating-ondemand branch 2 times, most recently from 176c1be to e91ee40 Compare August 6, 2021 17:18
@NicolasJiaxin
Copy link
Collaborator Author

You probably know this, but there is a complete log of the tests and the failed tests in RcppSimdJson.Rcheck/tests/tinytest.Rout.fail when you run the tests locally.

@lemire
Copy link
Owner

lemire commented Aug 20, 2021

I realize this but thanks for the reminder. I have not managed to get to it today, more later.

@lemire
Copy link
Owner

lemire commented Aug 21, 2021

@NicolasJiaxin Let us try to improve support for integers. Could you have a look at my proposal at simdjson/simdjson#1703

?

@NicolasJiaxin
Copy link
Collaborator Author

@lemire It works! As expected, all failed tests (except one) were related to numbers issue. The only test that was still failing was a test with valid_json, but I changed it because I think this is another instance of On Demand that does not know that the JSON is invalid (yet), but I think you should check it out to be sure it is ok to change it.

Comment on lines -171 to +173
expect_false(any(is_valid_json(valid_utf8)))
# Change to expect_true since valid json is only detected when parsed/accessed.
#expect_false(any(is_valid_json(valid_utf8)))
expect_true(any(is_valid_json(valid_utf8)))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one.

Comment on lines +513 to 516
if (FALSE) {
.write_file("JUNK JSON", test_file1)
.write_file('"VALID JSON"', test_file2)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this one I removed.

@lemire
Copy link
Owner

lemire commented Aug 29, 2021

@NicolasJiaxin That's fantastic. I think that's all we needed to do for the summer.

Would you do a PR from your repo to the eddelbuettel/rcppsimdjson repo, while explaining your work? Label it clearly as a prototype and explain a bit what you did. This way people will be able to build on your work (if they choose to do so).

@NicolasJiaxin
Copy link
Collaborator Author

A few remarks regarding On Demand after doing this work:

  • It would be nice to have something like object.count_elements() like for arrays (I am working on it).
  • Currently, we have is_integer() and is_negative() to identify the type of a number inside a ondemand::value without parsing it. However, unless I am mistaken, we cannot tell apart int64 and uint64. If possible, something like is_large_integer() would help.

@NicolasJiaxin
Copy link
Collaborator Author

I will close this PR as I have opened one to the main repo here.

@lemire
Copy link
Owner

lemire commented Aug 30, 2021

It would be nice to have something like object.count_elements() like for arrays (I am working on it).

I have opened an issue.

Currently, we have is_integer() and is_negative() to identify the type of a number inside a ondemand::value without parsing it. However, unless I am mistaken, we cannot tell apart int64 and uint64. If possible, something like is_large_integer() would help.

Can you elaborate on how this might be used? I am concerned about double and triple parsing. It would be a terrible pattern to do is_integer() then is_large_integer(), then to_uint64(). This would literally mean scanning the string number three times.

Right now, you can check if the number is negative. This is fast. If it is positive, then you can do is_integer() and then do to_uint64(). I do not want anyone to ever do this... people should just do get_number()... but let us say they do, then it will work. Then you can look at the size of whatever was returned by to_uint64() and decide whether it is large.

I am not dismissing your proposal. I just want do understand it.

That is, we just don't want to throw new functions into the API. We want to keep the API as tight as possible. Adding more functions makes it harder to use. Now, if we had more functions that can be used in a counterproductive manner, we risk making things worse.

(I am not dismissing your proposal. To be sure.)

@NicolasJiaxin
Copy link
Collaborator Author

I thought that is_large_integer would be called only once instead of is_integer. It would validate that it is an integer and that it is large. This would be useful in the case of this wrapper at least when scanning a number for the first time. As I have mentioned, the design of the wrapper is to scan through the document once to see what are the types present, and then scan a second time to put those elements in an appropriate structure. The first time we scan, it is important to distinguish between int64, uint64 and double. However, with is_integer and is_negative we cannot distinguish between int64 and uint64, so I resorted to use ondemand::number. This meant I parsed the number without using it just to distinguish its type. I figured that it would probably be better/faster to scan once with something like is_large_integer without computing the actual value.

@lemire
Copy link
Owner

lemire commented Sep 1, 2021

@NicolasJiaxin So let us say I have a number string ... 'xxxxxxxxx'. Now, I do not want you to ever scan the number twice. It seems to me that what you would do is something

  • check if it is is_large_integer, if not check if it is is_integer, if not conclude it is a float
  • then rescan a second time, calling get_uint64 or get_int64 or get_double

That's very bad because you could call is_large_integer, then is_integer, then get_double, thus scanning the input string three times.

Even just scanning the input string twice is really bad. I'd never want anyone to do it.

so I resorted to use ondemand::number.

Yeah. Parsing the numbers twice is not good.

@NicolasJiaxin
Copy link
Collaborator Author

@lemire Ahh... yes, I see what was your concern now. You are right, there is probably no useful point of having is_large_integer. It looks to me that for this design, the best approach is to scan numbers twice using ondemand::number. I can't seem to find a way around this...

@lemire
Copy link
Owner

lemire commented Sep 1, 2021

Let me see if I can do a patch that solves this, somewhat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants