-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Bundle ext/simdjson into core #6551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think it'd be more practical to provide a download link/script for jsonexamples if this RFC passes - there's 15MB of json in there, since tests are typically distributed with releases (or downloaded from git)
(Similar to how gen_stubs.php downloads php-parser and validates the sha256 sum)
|
||
for (simdjson::dom::key_value_pair field : simdjson::dom::object(element)) { | ||
zval value = create_object(field.value); | ||
add_property_zval_ex(&obj, field.key.data(), strlen(field.key.data()), &value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work for embedded null bytes (\u0000)? Use .length() instead of strlen?
break; | ||
case simdjson::dom::element_type::ARRAY : | ||
zval arr; | ||
array_init(&arr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be able to speed the ::ARRAY
case up a bit more by imitating what array_values does.
Something along these lines - there's a need to check if the size is 0xFFFFFF in addition to that, though
/**
* Get the size of the array (number of immediate children).
* It is a saturated value with a maximum of 0xFFFFFF: if the value
* is 0xFFFFFF then the size is 0xFFFFFF or greater.
*/
inline size_t size() const noexcept;
/* Initialize return array */
array_init_size(arr, arrlen);
zend_hash_real_init_packed(Z_ARRVAL_P(arr));
/* Go through input array and add values to the return array */
ZEND_HASH_FILL_PACKED(Z_ARRVAL_P(arr)) {
for (simdjson::dom::element child : simdjson::dom::array(element)) {
zval value = create_array(child);
ZEND_HASH_FILL_ADD(&value);
}
} ZEND_HASH_FILL_END();
break; | ||
case simdjson::dom::element_type::INT64 : ZVAL_LONG(&v, int64_t(element)); | ||
break; | ||
case simdjson::dom::element_type::UINT64 : ZVAL_LONG(&v, uint64_t(element)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this to work as expected, you'd
- need to add extra code for 32-bit builds of PHP and convert out of range values to doubles (
#if PHP_INT_SIZE == 4
) - Need to check if the value
> ZEND_LONG_MAX
and convert to double for out of range UINT64/INT64
zval v; | ||
switch (element.type()) { | ||
//ASCII sort | ||
case simdjson::dom::element_type::ARRAY : ZVAL_LONG(&v, uint64_t(simdjson::dom::array(element).size())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the value is 0xFFFFFF then the size is 0xFFFFFF or greater.
both for class array and for class object.
Can this add some tests of arrays and objects of size 16777216?
Considering checking that the first character of an stdClass object property (not array keys) isn't "\0" if (and only if) length > 0, and adding a test of the expected behavior // From ext/json/json_parser.y
static int php_json_parser_object_update(php_json_parser *parser, zval *object, zend_string *key, zval *zvalue)
{
/* if JSON_OBJECT_AS_ARRAY is set */
if (Z_TYPE_P(object) == IS_ARRAY) {
zend_symtable_update(Z_ARRVAL_P(object), key, zvalue);
} else {
if (ZSTR_LEN(key) > 0 && ZSTR_VAL(key)[0] == '\0') {
parser->scanner.errcode = PHP_JSON_ERROR_INVALID_PROPERTY_NAME;
zend_string_release_ex(key, 0);
zval_ptr_dtor_nogc(zvalue);
zval_ptr_dtor_nogc(object);
return FAILURE;
}
zend_std_write_property(Z_OBJ_P(object), key, zvalue, NULL);
Z_TRY_DELREF_P(zvalue);
} |
} | ||
|
||
// see https://github.com/simdjson/simdjson/blob/master/doc/performance.md#reusing-the-parser-for-maximum-efficiency | ||
simdjson::dom::parser parser; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work in thread-safe builds of php, if multiple threads are trying to use the same parser at the same time? https://www.php.net/manual/en/pthreads.requirements.php (I'm not familiar with PHP's tsrm code, but I think it might be possible to do something with thread local storage )
- pthreads was brittle the last time I checked, assuming I'm thinking of the right extension. My concern isn't pthreads, but rather when apache is running multiple php worker threads in a single thread sharing a static variable
Initializing this in the request init (or zeroing it out and manually calling the C++ constructor) and freeing it in request shutdown might be possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments, I'm assuming those comments also apply to the original pecl
It should still be possible for many web frameworks and tools to benefit from a SIMD json parser implementation in php
|
||
switch (stats) { | ||
case SIMDJSON_PARSE_FAIL: | ||
RETURN_NULL(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: inconsistent indent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, that was me :)
simdjson::dom::element doc; | ||
auto error = build_parsed_json_cust(doc, json, len, true, depth); | ||
if (error) { | ||
return SIMDJSON_PARSE_KEY_NOEXISTS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems inconsistent to throw for invalid json in other methods but not here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! (and generally, thanks for the review!)
This is great! Thanks! I don't know if the proposed OO API is up for debate, but I wish we had a single final class Json
{
public static function encode(mixed $value, int $flags = 0, int $depth = 512): string {}
public static function parse(string $json, bool $associative = false, int $depth = 512, int $flags = 0): mixed {}
public static function isValid(string $json): bool {}
public static function getKeyValue(string $json, string $key, bool $associative = false, int $depth = 512, int $flags = 0): mixed {}
public static function getKeyCount(string $json, string $key, int $depth = 512): int {}
public static function keyExists(string $json, string $key, int $depth = 512): ?bool {}
} Most of other programming languages do that. Some examples: Go:
Rust:
JavaScript:
|
Hi @javiereguiluz , Thanks for the ideas! The API is absolutely up for debate, the current PR should just be considered as a POC. :) Speaking about the unification: for now, it seems to make sense to combine the two classes, but we should also consider future use-cases (e.g. on-demand parsing). Most probably, a single |
Based on the internals discussion, I think this can be closed? Or do you plan to pursue the RFC? |
I have recently thought about closing it in this form. I'll maybe continue pursuing it later by using the proposed approach. |
This PR bundles https://github.com/crazyxman/simdjson_php with some major modifications to PHP, using the following API:
Besides being able to have a new, OO API, the underlying simdjson library (https://github.com/simdjson/simdjson) would offer quite a few new use-cases, as well as a major performance gain. Some benchmark results I got on my 2019 16" MacBook Pro:
$ ./benchmark/vendor/bin/phpbench run --report=table --group decode
$ ./benchmark/benchmark.php: