Skip to content

Conversation

kocsismate
Copy link
Member

@kocsismate kocsismate commented Dec 29, 2020

This PR bundles https://github.com/crazyxman/simdjson_php with some major modifications to PHP, using the following API:

final class JsonParser
{
    public static function parse(string $json, bool $associative = false, int $depth = 512, int $flags = 0): mixed {}

    public static function isValid(string $json): bool {}

    public static function getKeyValue(string $json, string $key, bool $associative = false, int $depth = 512, int $flags = 0): mixed {}

    public static function getKeyCount(string $json, string $key, int $depth = 512): int {}

    public static function keyExists(string $json, string $key, int $depth = 512): ?bool {}
}

final class JsonEncoder
{
    public static function encode(mixed $value, int $flags = 0, int $depth = 512): string {}
}

Besides being able to have a new, OO API, the underlying simdjson library (https://github.com/simdjson/simdjson) would offer quite a few new use-cases, as well as a major performance gain. Some benchmark results I got on my 2019 16" MacBook Pro:

$ ./benchmark/vendor/bin/phpbench run --report=table --group decode

+-------------+---------------------+--------+----------+-----------+-----------+-------+
| benchmark   | subject             | groups | mem_peak | mean      | best      | diff  |
+-------------+---------------------+--------+----------+-----------+-----------+-------+
| DecodeBench | simdjsonDecodeAssoc | decode | 753,248b | 0.00912ms | 0.00880ms | 1.00x |
| DecodeBench | simdjsonDecode      | decode | 753,248b | 0.00996ms | 0.00960ms | 1.09x |
| DecodeBench | jsonDecodeAssoc     | decode | 753,248b | 0.01756ms | 0.01740ms | 1.93x |
| DecodeBench | jsonDecode          | decode | 753,248b | 0.02004ms | 0.01960ms | 2.20x |
+-------------+---------------------+--------+----------+-----------+-----------+-------+

$ ./benchmark/benchmark.php:

filename             |json_decode()        |JsonParser::parse()  |JsonParser::isValid()|
-------------------- |---------------------|---------------------|---------------------|
apache_builds.json   |2.373 ms             |1.981 ms             |0.128 ms             |
canada.json          |106.118 ms           |21.523 ms            |3.792 ms             |
citm_catalog.json    |21.502 ms            |11.451 ms            |1.128 ms             |
github_events.json   |0.924 ms             |1.167 ms             |0.04 ms              |
gsoc-2018.json       |34.187 ms            |10.769 ms            |1.833 ms             |
instruments.json     |3.511 ms             |2.996 ms             |0.143 ms             |
marine_ik.json       |66.224 ms            |32.564 ms            |4.854 ms             |
mesh.json            |16.125 ms            |7.153 ms             |1.049 ms             |
mesh.pretty.json     |27.053 ms            |5.524 ms             |1.655 ms             |
numbers.json         |2.272 ms             |1.475 ms             |0.18 ms              |
random.json          |11.53 ms             |6.368 ms             |0.4 ms               |
twitter.json         |8.422 ms             |4.431 ms             |0.343 ms             |
twitterescaped.json  |9.902 ms             |4.786 ms             |0.568 ms             |
update-center.json   |10.365 ms            |5.74 ms              |0.354 ms             |

@kocsismate kocsismate added the RFC label Dec 29, 2020
@kocsismate kocsismate changed the title Add simdjson extension Bundle ext/simdjson into core Dec 29, 2020
Copy link
Contributor

@TysonAndre TysonAndre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think it'd be more practical to provide a download link/script for jsonexamples if this RFC passes - there's 15MB of json in there, since tests are typically distributed with releases (or downloaded from git)

(Similar to how gen_stubs.php downloads php-parser and validates the sha256 sum)


for (simdjson::dom::key_value_pair field : simdjson::dom::object(element)) {
zval value = create_object(field.value);
add_property_zval_ex(&obj, field.key.data(), strlen(field.key.data()), &value);
Copy link
Contributor

@TysonAndre TysonAndre Dec 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work for embedded null bytes (\u0000)? Use .length() instead of strlen?

break;
case simdjson::dom::element_type::ARRAY :
zval arr;
array_init(&arr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to speed the ::ARRAY case up a bit more by imitating what array_values does.

Something along these lines - there's a need to check if the size is 0xFFFFFF in addition to that, though

  /**
   * Get the size of the array (number of immediate children).
   * It is a saturated value with a maximum of 0xFFFFFF: if the value
   * is 0xFFFFFF then the size is 0xFFFFFF or greater.
   */
  inline size_t size() const noexcept;
	/* Initialize return array */
	array_init_size(arr, arrlen);
	zend_hash_real_init_packed(Z_ARRVAL_P(arr));

	/* Go through input array and add values to the return array */
	ZEND_HASH_FILL_PACKED(Z_ARRVAL_P(arr)) {
            for (simdjson::dom::element child : simdjson::dom::array(element)) {
                zval value = create_array(child);
                ZEND_HASH_FILL_ADD(&value);
            }
	} ZEND_HASH_FILL_END();

break;
case simdjson::dom::element_type::INT64 : ZVAL_LONG(&v, int64_t(element));
break;
case simdjson::dom::element_type::UINT64 : ZVAL_LONG(&v, uint64_t(element));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this to work as expected, you'd

  1. need to add extra code for 32-bit builds of PHP and convert out of range values to doubles (#if PHP_INT_SIZE == 4)
  2. Need to check if the value > ZEND_LONG_MAX and convert to double for out of range UINT64/INT64

zval v;
switch (element.type()) {
//ASCII sort
case simdjson::dom::element_type::ARRAY : ZVAL_LONG(&v, uint64_t(simdjson::dom::array(element).size()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the value is 0xFFFFFF then the size is 0xFFFFFF or greater. both for class array and for class object.

Can this add some tests of arrays and objects of size 16777216?

@TysonAndre
Copy link
Contributor

php > $x->{"\0b"} = 123;

Warning: Uncaught Error: Cannot access property started with '\0' in php shell code:1
Stack trace:
#0 {main}
  thrown in php shell code on line 1

Considering checking that the first character of an stdClass object property (not array keys) isn't "\0" if (and only if) length > 0, and adding a test of the expected behavior

// From ext/json/json_parser.y
static int php_json_parser_object_update(php_json_parser *parser, zval *object, zend_string *key, zval *zvalue)
{
	/* if JSON_OBJECT_AS_ARRAY is set */
	if (Z_TYPE_P(object) == IS_ARRAY) {
		zend_symtable_update(Z_ARRVAL_P(object), key, zvalue);
	} else {
		if (ZSTR_LEN(key) > 0 && ZSTR_VAL(key)[0] == '\0') {
			parser->scanner.errcode = PHP_JSON_ERROR_INVALID_PROPERTY_NAME;
			zend_string_release_ex(key, 0);
			zval_ptr_dtor_nogc(zvalue);
			zval_ptr_dtor_nogc(object);
			return FAILURE;
		}
		zend_std_write_property(Z_OBJ_P(object), key, zvalue, NULL);
		Z_TRY_DELREF_P(zvalue);
	}

}

// see https://github.com/simdjson/simdjson/blob/master/doc/performance.md#reusing-the-parser-for-maximum-efficiency
simdjson::dom::parser parser;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work in thread-safe builds of php, if multiple threads are trying to use the same parser at the same time? https://www.php.net/manual/en/pthreads.requirements.php (I'm not familiar with PHP's tsrm code, but I think it might be possible to do something with thread local storage )

  • pthreads was brittle the last time I checked, assuming I'm thinking of the right extension. My concern isn't pthreads, but rather when apache is running multiple php worker threads in a single thread sharing a static variable

Initializing this in the request init (or zeroing it out and manually calling the C++ constructor) and freeing it in request shutdown might be possible

Copy link
Contributor

@TysonAndre TysonAndre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments, I'm assuming those comments also apply to the original pecl

It should still be possible for many web frameworks and tools to benefit from a SIMD json parser implementation in php


switch (stats) {
case SIMDJSON_PARSE_FAIL:
RETURN_NULL();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: inconsistent indent

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, that was me :)

simdjson::dom::element doc;
auto error = build_parsed_json_cust(doc, json, len, true, depth);
if (error) {
return SIMDJSON_PARSE_KEY_NOEXISTS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems inconsistent to throw for invalid json in other methods but not here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! (and generally, thanks for the review!)

@javiereguiluz
Copy link
Contributor

This is great! Thanks!

I don't know if the proposed OO API is up for debate, but I wish we had a single Json class with all the relevant methods:

final class Json
{
    public static function encode(mixed $value, int $flags = 0, int $depth = 512): string {}

    public static function parse(string $json, bool $associative = false, int $depth = 512, int $flags = 0): mixed {}

    public static function isValid(string $json): bool {}

    public static function getKeyValue(string $json, string $key, bool $associative = false, int $depth = 512, int $flags = 0): mixed {}

    public static function getKeyCount(string $json, string $key, int $depth = 512): int {}

    public static function keyExists(string $json, string $key, int $depth = 512): ?bool {}
}

Most of other programming languages do that. Some examples:

Go:

json.Marshal(someVariable)
json.Unmarshal(jsonContent, &someVariable)

Rust:

let encoded = json::encode(&object).unwrap();
let decoded: TestStruct = json::decode(&encoded).unwrap();

JavaScript:

JSON.stringify(someVariable)
JSON.parse(jsonContent)

@kocsismate
Copy link
Member Author

Hi @javiereguiluz ,

Thanks for the ideas! The API is absolutely up for debate, the current PR should just be considered as a POC. :)

Speaking about the unification: for now, it seems to make sense to combine the two classes, but we should also consider future use-cases (e.g. on-demand parsing). Most probably, a single Json class will be enough :)

@nikic
Copy link
Member

nikic commented Mar 15, 2021

Based on the internals discussion, I think this can be closed? Or do you plan to pursue the RFC?

@kocsismate
Copy link
Member Author

Based on the internals discussion, I think this can be closed? Or do you plan to pursue the RFC?

I have recently thought about closing it in this form. I'll maybe continue pursuing it later by using the proposed approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants