Cached validation #126

barbieri · 2020-12-02T17:13:28Z

Introduction

cppgraphqlgen validates each against a schema before it's executed.

Most tools (ie: Apollo) uses a given schema to work on, most will load it using a JSON file resulting from an introspection query (usually called schema.json).

However cppgraphqlgen does this by doing an introspection query prior to execute each query. This is bad for two reasons:

it requires the server to have introspection enabled, this may not be desired in production environments.
the schema is constant and the validation should reuse the schema, avoiding the introspection query on each user-query.

In particular the last point is hurting us, since for large schemas (or small user queries) the cost of each introspection is way larger than the user query itself, so the validation takes more time than the execution.

This PR moves the schema information used to do validation to a class ValidationContext that is hosted by graphql::service::Request and shared to each ValidateExecutableVisitor used.

The ValidationContext can be created using an introspection query on the service or a response::Value with the results of such query, useful for people using parseJSON().

In addition to that, ValidateType is a specific structure, instead of using the more expensive response::Value, this allows less memory usage and faster/simpler usage due direct access to kind (as enum), name and ofType (as shared pointer).

Test Environment

The following tests were executed on Ubuntu 20.04.1 LTS (Focal Fossa) running on a docker on MacOS X.

Compiled with 9.3.0 and running kernel 5.4.39-linuxkit.

The test was done using a loop of 100 elements inside samples/today/sample.cpp:

diff --git a/samples/today/sample.cpp b/samples/today/sample.cpp
index 376a730..cb5e2b1 100644
--- a/samples/today/sample.cpp
+++ b/samples/today/sample.cpp
@@ -59,6 +59,7 @@ int main(int argc, char** argv)
 
        std::cout << "Created the service..." << std::endl;
 
+       for (int i=0; i < 100; i++) {
        try
        {
                peg::ast query;
@@ -97,6 +98,7 @@ int main(int argc, char** argv)
                std::cerr << ex.what() << std::endl;
                return 1;
        }
+       }
 
        return 0;
 }

The input query:

query {
    appointments {
        pageInfo { hasNextPage }
        edges {
            node {
                id
                when
                subject
                isNow
            }
        }
    }
}

valgrind --tool=massif

Valgrind provides a heap profiler called Massif.

As seen below in ms_print results, the introspection query costs much more than the actual query (since it's small). If we keep redoing the introspection on each query, we keep memory pressure. This also may lead to fragmentation that causes more memory to be used, as seen in the final snapshot of each:

Code	Initial (B)	Final (B)	Peak (B)
Pristine	413,232	372,936	413,232
Cached	273,536	322,088	695,248

Notice there is a higher peak using Cached since the introspection is handled as read-only (values are not released as the ValidationContext is built). This was done to enable the response::Value to be used elsewhere, as well as being loaded from parseJSON().

However, after the peak (695,248) it goes down (324,736) and remains mostly stable. Around 50KB smaller than the Pristine solution.

The results were edited to present only the most relevant information.

Pristine results (commit: `3add6d3`)

    KB
403.5^ #
     | #           :                                   ::  :    :          ::
     | #::         :   ::::  ::   :         ::       : : :::    :  :: @: : ::@
     | #::   ::  ::::::::: :@:::  :  : @:: :::::     ::: : ::   ::::::@::: ::@
     | #:::::: : : ::: ::: :@:::  :  ::@:  ::::   :: ::: : ::   :: :::@::: ::@
     | #::: :: : : ::: ::: :@:::  :: ::@:  ::::   :  ::: : ::  ::: :::@::: ::@
     | #::: :: : : ::: ::: :@:::  :: ::@: ::::: ::: :::: : ::  ::: :::@::: ::@
     | #::: :: ::: ::: ::: :@::::::::::@: ::::: : : :::: : ::::::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
   0 +----------------------------------------------------------------------->Gi
     0                                                                   11.34

Number of snapshots: 65
 Detailed snapshots: [1 (peak), 18, 28, 53, 63]

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1    247,399,635          413,232          382,134        31,098            0

...

 54 11,041,472,348          374,496          349,680        24,816            0
 55 11,155,054,291          359,776          336,956        22,820            0
 56 11,268,636,351          367,192          344,611        22,581            0
 57 11,382,217,337          380,120          357,087        23,033            0
 58 11,495,799,440          317,968          300,831        17,137            0
 59 11,609,381,341          275,864          262,568        13,296            0
 60 11,722,965,210          394,488          365,022        29,466            0
 61 11,836,546,650          372,120          345,123        26,997            0
 62 11,950,130,852          399,536          370,048        29,488            0
 63 12,063,713,851          380,104          352,716        27,388            0
 64 12,177,294,258          372,936          346,969        25,967            0

Cached Validation Results

    KB
679.0^                            #
     |                           :#
     |                       @@:::#
     |                    @@:@ :::#
     |                  ::@@:@ :::#
     |             ::@::::@@:@ :::#
     |             ::@ :::@@:@ :::#
     |       @ ::::::@ :::@@:@ :::#
     |     ::@:::: ::@ :::@@:@ :::#
     |   ::::@:::: ::@ :::@@:@ :::#
     |  :: ::@:::: ::@ :::@@:@ :::#
     |  :: ::@:::: ::@ :::@@:@ :::#:@::::::::::@:::::::@:::@::::::@::::::@::::
     | ::: ::@:::: ::@ :::@@:@ :::#:@::::::::: @:::::: @: :@::::::@::::::@::::
     |@::: ::@:::: ::@ :::@@:@ :::#:@::::::::: @:::::: @: :@::::::@::::::@::::
     |@::: ::@:::: ::@ :::@@:@ :::#:@::::::::: @:::::: @: :@::::::@::::::@::::
     |@::: ::@:::: ::@ :::@@:@ :::#:@::::::::: @:::::: @: :@::::::@::::::@::::
     |@::: ::@:::: ::@ :::@@:@ :::#:@::::::::: @:::::: @: :@::::::@::::::@::::
     |@::: ::@:::: ::@ :::@@:@ :::#:@::::::::: @:::::: @: :@::::::@::::::@::::
     |@::: ::@:::: ::@ :::@@:@ :::#:@::::::::: @:::::: @: :@::::::@::::::@::::
     |@::: ::@:::: ::@ :::@@:@ :::#:@::::::::: @:::::: @: :@::::::@::::::@::::
   0 +----------------------------------------------------------------------->Mi
     0                                                                   547.7

Number of snapshots: 78
 Detailed snapshots: [1, 7, 14, 19, 20, 22, 26 (peak), 28, 38, 46, 50, 60, 70]

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1      6,099,963          273,536          260,678        12,858            0
  2     13,630,792          303,472          286,886        16,586            0
  3     20,540,448          364,544          343,318        21,226            0
  4     29,108,693          390,576          365,894        24,682            0
  5     39,953,980          418,056          390,071        27,985            0
  6     49,004,409          419,864          391,297        28,567            0
  7     60,393,483          453,360          421,343        32,017            0
  8     70,792,208          438,912          406,857        32,055            0
  9     79,450,974          470,392          435,255        35,137            0
 10     86,750,721          462,080          426,707        35,373            0
 11     94,428,296          480,288          442,930        37,358            0
 12    105,685,674          522,360          480,294        42,066            0
 13    114,027,797          530,872          487,348        43,524            0
 14    120,219,351          544,840          499,775        45,065            0
 15    127,206,286          549,008          502,891        46,117            0
 16    136,595,824          554,976          507,668        47,308            0
 17    147,169,402          565,392          515,942        49,450            0
 18    153,538,644          566,696          516,510        50,186            0
 19    162,028,276          593,160          540,231        52,929            0
 20    172,315,488          613,064          557,830        55,234            0
 21    179,148,247          600,184          544,860        55,324            0
 22    189,592,002          632,832          574,212        58,620            0
 23    201,832,220          648,744          587,692        61,052            0
 24    210,197,711          654,488          591,920        62,568            0
 25    215,645,861          668,456          604,413        64,043            0
 26    223,665,287          695,248          625,625        69,623            0
 27    231,374,858          324,736          304,885        19,851            0

...

 70    536,166,008          322,616          303,149        19,467            0
 71    541,614,660          328,928          308,747        20,181            0
 72    547,064,031          324,880          305,069        19,811            0
 73    552,514,315          339,880          318,987        20,893            0
 74    557,967,268          326,800          306,789        20,011            0
 75    563,416,310          339,264          318,226        21,038            0
 76    568,866,536          322,088          302,485        19,603            0
 77    574,314,700          293,632          277,700        15,932            0

valgrind --tool=dhat

Valgrind provides a dynamic heap analysis tool called DHAT.

As seen in the results from dh_view.html, while we have a peak (t-gmax: 629Kb) as explained in valgrind --tool=massif, we have much less bytes used, with much less reads and writes (total), and increased memory utilization (reads/writes are greater than total):

Code	Used (B)	Read (B)	Write (B)
Pristine	578,735,537	418,677,007	422,701,596
Cached	25,788,030	26,349,575	26,274,210

That's 22 times less memory used and around 15 times less memory being read and write.

Comparing most important allocation origins:

graphql::service::SelectionSetParams::SelectionSetParams(...) (GraphQLService.h:144)
graphql::service::ResolverParams::ResolverParams(...) (GraphQLService.h:233)

Code	AP	Total (B)	Max (B)
Pristine	1.1.2.1.1	87,456,000	3,360
Cached	1.1.4.1.1.1	3,161,280	2,880

27 times better.

graphql::service::SelectionVisitor::visitField(...) (GraphQLService.cpp:1044)
graphql::service::SelectionVisitor::visit(...) (GraphQLService.cpp:940)

Code	AP	Total (B)	Max (B)
Pristine	1.1.2.1.2	29,664,000	480
Cached	1.1.4.1.1.2	1,044,960	480

28 times better.

graphql::service::SelectionSetParams::SelectionSetParams(...) (GraphQLService.h:144)
graphql::service::ResolverParams::ResolverParams(...) (GraphQLService.cpp:504)
graphql::service::SelectionVisitor::visitField(...) (GraphQLService.cpp:1048)

Code	AP	Total (B)	Max (B)
Pristine	1.1.2.2.1.1	29,664,000	4,800
Cached	1.1.4.1.2.1.1	1,049,280	10,080

28 times better.

graphql::service::SelectionSetParams::SelectionSetParams(...) (GraphQLService.h:144)
graphql::service::FieldParams::FieldParams(...) (GraphQLService.cpp:164)

Code	AP	Total (B)
Pristine	1.1.2.2.1.2	29,376,000
Cached	1.1.4.1.2.1.2	1,048,800

28 times better.

Note: no Max since it's a block with only insignificant children.

graphql::service::SelectionVisitor::visitField(...) (GraphQLService.cpp:1031)
graphql::service::SelectionVisitor::visit(...) (GraphQLService.cpp:940)
graphql::service::Object::resolve(...) (GraphQLService.cpp:1241)

Code	AP	Total (B)
Pristine	1.1.2.2.2.1	17,712,000
Cached	1.1.4.1.2.2	1,044,960

17 times better.

Note: no Max since it's a block with only insignificant children.

The results were edited to present only the most relevant information.

Pristine results (commit: `3add6d3`)

Times {
  t-gmax: 247,395,291 instrs (2.02% of program duration)
  t-end:  12,222,351,721 instrs
}

▼ AP 1/1 (2 children) {
    Total:     578,735,537 bytes (100%, 47,350.59/Minstr) in 3,342,560 blocks (100%, 273.48/Minstr), avg size 173.14 bytes, avg lifetime 7,428,371.41 instrs (0.06% of program duration)
    Reads:     418,677,007 bytes (100%, 34,255.03/Minstr), 0.72/byte
    Writes:    422,701,596 bytes (100%, 34,584.31/Minstr), 0.73/byte
  ├─▼ AP 1.1/2 (14 children) {
  │     Total:     578,582,641 bytes (99.97%, 47,338.08/Minstr) in 3,342,355 blocks (99.99%, 273.46/Minstr), avg size 173.11 bytes, avg lifetime 7,410,581.76 instrs (0.06% of program duration)
  │     Reads:     418,520,722 bytes (99.96%, 34,242.24/Minstr), 0.72/byte
  │     Writes:    422,586,724 bytes (99.97%, 34,574.91/Minstr), 0.73/byte
  │   ├─▶ AP 1.1.2/14 (3 children) {
  │   │     Total:     234,816,000 bytes (40.57%, 19,212.01/Minstr) in 489,200 blocks (14.64%, 40.03/Minstr), avg size 480 bytes, avg lifetime 399,785.49 instrs (0% of program duration)
  │   │     Reads:     51,090,800 bytes (12.2%, 4,180.11/Minstr), 0.22/byte
  │   │     Writes:    31,157,500 bytes (7.37%, 2,549.22/Minstr), 0.13/byte
  │   │   ├─▼ AP 1.1.2.1/3 (3 children) {
  │   │   │     Total:     117,264,000 bytes (20.26%, 9,594.23/Minstr) in 244,300 blocks (7.31%, 19.99/Minstr), avg size 480 bytes, avg lifetime 71,559.19 instrs (0% of program duration)
  │   │   │   ├── AP 1.1.2.1.1/3 {
  │   │   │   │     Total:     87,456,000 bytes (15.11%, 7,155.42/Minstr) in 182,200 blocks (5.45%, 14.91/Minstr), avg size 480 bytes, avg lifetime 82,661.48 instrs (0% of program duration)
  │   │   │   │     Max:       3,360 bytes in 7 blocks, avg size 480 bytes
  │   │   │   │       #11: 0x194294: graphql::service::SelectionSetParams::SelectionSetParams(...) (GraphQLService.h:144)
  │   │   │   │       #12: 0x1942D2: graphql::service::ResolverParams::ResolverParams(...) (GraphQLService.h:233)
  │   │   │   ├── AP 1.1.2.1.2/3 {
  │   │   │   │     Total:     29,664,000 bytes (5.13%, 2,427.03/Minstr) in 61,800 blocks (1.85%, 5.06/Minstr), avg size 480 bytes, avg lifetime 39,097.38 instrs (0% of program duration)
  │   │   │   │     Max:       480 bytes in 1 blocks, avg size 480 bytes
  │   │   │   │       #11: 0x29E4F4: graphql::service::SelectionVisitor::visitField(...) (GraphQLService.cpp:1044)
  │   │   │   │       #12: 0x29DABA: graphql::service::SelectionVisitor::visit(...) (GraphQLService.cpp:940)
  │   │   ├─▼ AP 1.1.2.2/3 (3 children) {
  │   │   │     Total:     117,216,000 bytes (20.25%, 9,590.3/Minstr) in 244,200 blocks (7.31%, 19.98/Minstr), avg size 480 bytes, avg lifetime 687,649.65 instrs (0.01% of program duration)
  │   │   │     Reads:     51,090,800 bytes (12.2%, 4,180.11/Minstr), 0.44/byte
  │   │   │     Writes:    31,157,500 bytes (7.37%, 2,549.22/Minstr), 0.27/byte
  │   │   │   ├─▼ AP 1.1.2.2.1/3 (3 children) {
  │   │   │   │     Total:     65,856,000 bytes (11.38%, 5,388.16/Minstr) in 137,200 blocks (4.1%, 11.23/Minstr), avg size 480 bytes, avg lifetime 1,160,216.45 instrs (0.01% of program duration)
  │   │   │   │     Reads:     27,983,600 bytes (6.68%, 2,289.54/Minstr), 0.42/byte
  │   │   │   │     Writes:    18,299,300 bytes (4.33%, 1,497.2/Minstr), 0.28/byte
  │   │   │   │   ├── AP 1.1.2.2.1.1/3 {
  │   │   │   │   │     Total:     29,664,000 bytes (5.13%, 2,427.03/Minstr) in 61,800 blocks (1.85%, 5.06/Minstr), avg size 480 bytes, avg lifetime 814,831.01 instrs (0.01% of program duration)
  │   │   │   │   │     Max:       4,800 bytes in 10 blocks, avg size 480 bytes
  │   │   │   │   │     Reads:     17,508,200 bytes (4.18%, 1,432.47/Minstr), 0.59/byte
  │   │   │   │   │     Writes:    8,622,100 bytes (2.04%, 705.44/Minstr), 0.29/byte
  │   │   │   │   │       ^10: 0x1B6946: graphql::service::SelectionSetParams::SelectionSetParams(...) (GraphQLService.h:144)
  │   │   │   │   │       #11: 0x29B195: graphql::service::ResolverParams::ResolverParams(...) (GraphQLService.cpp:504)
  │   │   │   │   │       #12: 0x29E5C6: graphql::service::SelectionVisitor::visitField(...) (GraphQLService.cpp:1048)
  │   │   │   │   ├─▼ AP 1.1.2.2.1.2/3 (1 children) {
  │   │   │   │   │     Total:     29,376,000 bytes (5.08%, 2,403.47/Minstr) in 61,200 blocks (1.83%, 5.01/Minstr), avg size 480 bytes, avg lifetime 11,348.13 instrs (0% of program duration)
  │   │   │   │   │     Reads:     6,589,400 bytes (1.57%, 539.13/Minstr), 0.22/byte
  │   │   │   │   │     Writes:    8,276,600 bytes (1.96%, 677.17/Minstr), 0.28/byte
  │   │   │   │   │       ^10: 0x1B6946: graphql::service::SelectionSetParams::SelectionSetParams(...) (GraphQLService.h:144)
  │   │   │   │   │       #11: 0x299141: graphql::service::FieldParams::FieldParams(...) (GraphQLService.cpp:164)
  │   │   │   ├─▼ AP 1.1.2.2.2/3 (2 children) {
  │   │   │   │     Total:     29,664,000 bytes (5.13%, 2,427.03/Minstr) in 61,800 blocks (1.85%, 5.06/Minstr), avg size 480 bytes, avg lifetime 45,409.13 instrs (0% of program duration)
  │   │   │   │     Reads:     13,660,500 bytes (3.26%, 1,117.67/Minstr), 0.46/byte
  │   │   │   │     Writes:    8,636,900 bytes (2.04%, 706.65/Minstr), 0.29/byte
  │   │   │   │   ├── AP 1.1.2.2.2.1/2 {
  │   │   │   │   │     Total:     17,712,000 bytes (3.06%, 1,449.15/Minstr) in 36,900 blocks (1.1%, 3.02/Minstr), avg size 480 bytes, avg lifetime 43,341.1 instrs (0% of program duration)
  │   │   │   │   │       ^10: 0x29E3FC: graphql::service::SelectionVisitor::visitField(...) (GraphQLService.cpp:1031)
  │   │   │   │   │       ^11: 0x29DABA: graphql::service::SelectionVisitor::visit(...) (GraphQLService.cpp:940)
  │   │   │   │   │       #12: 0x2A0F3F: graphql::service::Object::resolve(...) (GraphQLService.cpp:1241)


AP significance threshold: total >= 33,425.6 blocks (1%)

Cached Validation Results

Times {
  t-gmax: 224,143,986 instrs (38.98% of program duration)
  t-end:  575,074,817 instrs
}

▼ AP 1/1 (2 children) {
    Total:     25,788,030 bytes (100%, 44,842.91/Minstr) in 152,884 blocks (100%, 265.85/Minstr), avg size 168.68 bytes, avg lifetime 8,901,103.72 instrs (1.55% of program duration)
    Reads:     26,349,575 bytes (100%, 45,819.39/Minstr), 1.02/byte
    Writes:    26,274,210 bytes (100%, 45,688.33/Minstr), 1.02/byte
  ├─▼ AP 1.1/2 (14 children) {
  │     Total:     25,635,134 bytes (99.41%, 44,577.04/Minstr) in 152,679 blocks (99.87%, 265.49/Minstr), avg size 167.9 bytes, avg lifetime 8,897,176.13 instrs (1.55% of program duration)
  │     Reads:     26,193,290 bytes (99.41%, 45,547.62/Minstr), 1.02/byte
  │     Writes:    26,159,338 bytes (99.56%, 45,488.58/Minstr), 1.02/byte
  │   ├─▼ AP 1.1.4/14 (2 children) {
  │   │     Total:     8,493,120 bytes (32.93%, 14,768.72/Minstr) in 17,694 blocks (11.57%, 30.77/Minstr), avg size 480 bytes, avg lifetime 443,031.55 instrs (0.08% of program duration)
  │   │     Reads:     1,822,343 bytes (6.92%, 3,168.88/Minstr), 0.21/byte
  │   │     Writes:    1,118,090 bytes (4.26%, 1,944.25/Minstr), 0.13/byte
  │   │   ├─▼ AP 1.1.4.1/2 (3 children) {
  │   │   │     Total:     8,488,800 bytes (32.92%, 14,761.21/Minstr) in 17,685 blocks (11.57%, 30.75/Minstr), avg size 480 bytes, avg lifetime 443,226.75 instrs (0.08% of program duration)
  │   │   │     Reads:     1,822,343 bytes (6.92%, 3,168.88/Minstr), 0.21/byte
  │   │   │     Writes:    1,118,090 bytes (4.26%, 1,944.25/Minstr), 0.13/byte
  │   │   │   ├─▼ AP 1.1.4.1.1/3 (3 children) {
  │   │   │   │     Total:     4,350,240 bytes (16.87%, 7,564.65/Minstr) in 9,063 blocks (5.93%, 15.76/Minstr), avg size 480 bytes, avg lifetime 68,924.73 instrs (0.01% of program duration)
  │   │   │   │   ├── AP 1.1.4.1.1.1/3 {
  │   │   │   │   │     Total:     3,161,280 bytes (12.26%, 5,497.16/Minstr) in 6,586 blocks (4.31%, 11.45/Minstr), avg size 480 bytes, avg lifetime 80,833.43 instrs (0.01% of program duration)
  │   │   │   │   │     Max:       2,880 bytes in 6 blocks, avg size 480 bytes
  │   │   │   │   │       #11: 0x1942F8: graphql::service::SelectionSetParams::SelectionSetParams(...) (GraphQLService.h:144)
  │   │   │   │   │       #12: 0x194336: graphql::service::ResolverParams::ResolverParams(...) (GraphQLService.h:233)
  │   │   │   │   ├── AP 1.1.4.1.1.2/3 {
  │   │   │   │   │     Total:     1,044,960 bytes (4.05%, 1,817.09/Minstr) in 2,177 blocks (1.42%, 3.79/Minstr), avg size 480 bytes, avg lifetime 40,092.55 instrs (0.01% of program duration)
  │   │   │   │   │     Max:       480 bytes in 1 blocks, avg size 480 bytes
  │   │   │   │   │       #11: 0x29E558: graphql::service::SelectionVisitor::visitField(...) (GraphQLService.cpp:1044)
  │   │   │   │   │       #12: 0x29DB1E: graphql::service::SelectionVisitor::visit(...) (GraphQLService.cpp:940)
  │   │   │   ├─▼ AP 1.1.4.1.2/3 (3 children) {
  │   │   │   │     Total:     4,090,080 bytes (15.86%, 7,112.26/Minstr) in 8,521 blocks (5.57%, 14.82/Minstr), avg size 480 bytes, avg lifetime 804,305.27 instrs (0.14% of program duration)
  │   │   │   │     Reads:     1,822,343 bytes (6.92%, 3,168.88/Minstr), 0.45/byte
  │   │   │   │     Writes:    1,118,090 bytes (4.26%, 1,944.25/Minstr), 0.27/byte
  │   │   │   │   ├─▼ AP 1.1.4.1.2.1/3 (3 children) {
  │   │   │   │   │     Total:     2,245,920 bytes (8.71%, 3,905.44/Minstr) in 4,679 blocks (3.06%, 8.14/Minstr), avg size 480 bytes, avg lifetime 1,398,266.28 instrs (0.24% of program duration)
  │   │   │   │   │     Reads:     979,504 bytes (3.72%, 1,703.26/Minstr), 0.44/byte
  │   │   │   │   │     Writes:    635,763 bytes (2.42%, 1,105.53/Minstr), 0.28/byte
  │   │   │   │   │   ├── AP 1.1.4.1.2.1.1/3 {
  │   │   │   │   │   │     Total:     1,049,280 bytes (4.07%, 1,824.6/Minstr) in 2,186 blocks (1.43%, 3.8/Minstr), avg size 480 bytes, avg lifetime 1,126,835.31 instrs (0.2% of program duration)
  │   │   │   │   │   │     Max:       10,080 bytes in 21 blocks, avg size 480 bytes
  │   │   │   │   │   │       ^10: 0x1B69AA: graphql::service::SelectionSetParams::SelectionSetParams(...) (GraphQLService.h:144)
  │   │   │   │   │   │       #11: 0x29B1F9: graphql::service::ResolverParams::ResolverParams(...) (GraphQLService.cpp:504)
  │   │   │   │   │   │       #12: 0x29E62A: graphql::service::SelectionVisitor::visitField(...) (GraphQLService.cpp:1048)
  │   │   │   │   │   ├─▼ AP 1.1.4.1.2.1.2/3 (1 children) {
  │   │   │   │   │   │     Total:     1,048,800 bytes (4.07%, 1,823.76/Minstr) in 2,185 blocks (1.43%, 3.8/Minstr), avg size 480 bytes, avg lifetime 14,699.06 instrs (0% of program duration)
  │   │   │   │   │   │     Reads:     231,438 bytes (0.88%, 402.45/Minstr), 0.22/byte
  │   │   │   │   │   │     Writes:    297,532 bytes (1.13%, 517.38/Minstr), 0.28/byte
  │   │   │   │   │   │       ^10: 0x1B69AA: graphql::service::SelectionSetParams::SelectionSetParams(...) (GraphQLService.h:144)
  │   │   │   │   │   │       #11: 0x2991A5: graphql::service::FieldParams::FieldParams(...) (GraphQLService.cpp:164)
  │   │   │   │   ├─▼ AP 1.1.4.1.2.2/3 (1 children) {
  │   │   │   │   │     Total:     1,044,960 bytes (4.05%, 1,817.09/Minstr) in 2,177 blocks (1.42%, 3.79/Minstr), avg size 480 bytes, avg lifetime 46,384.37 instrs (0.01% of program duration)
  │   │   │   │   │     Reads:     485,500 bytes (1.84%, 844.24/Minstr), 0.46/byte
  │   │   │   │   │     Writes:    309,304 bytes (1.18%, 537.85/Minstr), 0.3/byte
  │   │   │   │   │       #10: 0x29E460: graphql::service::SelectionVisitor::visitField(graphql::peg::ast_node const&) (GraphQLService.cpp:1031)
  │   │   │   │   │       #11: 0x29DB1E: graphql::service::SelectionVisitor::visit(graphql::peg::ast_node const&) (GraphQLService.cpp:940)

AP significance threshold: total >= 1,528.84 blocks (1%)

ghost · 2020-12-02T17:13:43Z

All CLA requirements met.

wravery · 2020-12-03T00:54:48Z

Great job on the performance analysis! I'll try to get this reviewed and merged soon.

barbieri · 2020-12-03T01:09:32Z

Thanks @wravery, we're using this in an embedded system and it's very sensitive to memory pressure.

I have other patches in my https://github.com/profusion/cppgraphqlgen/tree/perf, but I'll wait this one to be reviewed and merged so I can rebase and propose the other PR.

wravery

I think a lot of the benefits of this change could be achieved just by keeping a std::unique_ptr<ValidateExecutableVisitor> alive in a member on the Request. Would you try that change as a smaller comparison? I think it would be a lot easier to reason about it that way than by passing the response to the IntrospectionQuery into the Request. If that one change is enough to get similar results I'd prefer to do that.

BTW, depending on your scenario and how much query caching you can do, you might be able to erase the impact of validation entirely after an initial parse. The sample test case would have 0 overhead for validation after the first iteration if the peg::ast variable was declared object outside of the loop. After validation it sets ast.validated to true and it validation after that.

wravery · 2020-12-03T20:00:07Z

include/Validation.h

+	std::shared_ptr<ValidateType> type;
+
+	ValidateArgument() = default;
+	ValidateArgument(std::shared_ptr<ValidateType>& type_)


const std::shared_ptr<ValidateType>&?

Alternatively, if you move the declaration of type to the top, the default constructor/initializer order should just do the right thing, with default values for the other 2 members.

Actually, that applies to ValidateType as well. You could omit the constructor overrides and use initializer syntax and default constructors for pretty much every struct.

ok, will change that one.

wravery · 2020-12-03T22:49:23Z

include/Validation.h

+	std::shared_ptr<ValidateType> returnType;
 	ValidateTypeFieldArguments arguments;
+
+	ValidateTypeField() = default;


I think all of these constructors match the default compiler generated constructors. You could just omit them.

wravery · 2020-12-03T22:54:28Z

include/Validation.h


 	bool isInputType() const;
-	ValidateType getType();
+	std::shared_ptr<ValidateType>&& getType();


Should return by value. From what I've heard return-value-optimization (RVO) works better that way.

To get move semantics you can move to a local variable (e.g. auto result = std::move(value);) and then return the local variable by value, and all the RVO goodness should apply.

oh really? I changed some of those to get it shorter 👀

seriously, the more I look into C++, the more I dislike this language 😆

wravery · 2020-12-06T07:15:11Z

src/Validation.cpp

+	// This is taking advantage of the fact that during validation we can choose to execute
+	// unvalidated queries against the Introspection schema. This way we can use fragment
+	// cycles to expand an arbitrary number of wrapper types.
+	ast.validated = true;


Using the canonical IntrospectionQuery does have this limitation, where it can only follow links to a fixed depth. It's compatible with more tools, but it does put a limit on the complexity of the schema.

I get this, but while there may be a combination of lists, non-null and the actual type, it's pretty unusual in real life, even more that it would not work in any other GraphQL tool.

Do you want me to move and use the non-standard recursive here?

wravery · 2020-12-06T20:05:22Z

I think a lot of the benefits of this change could be achieved just by keeping a std::unique_ptr alive in a member on the Request. Would you try that change as a smaller comparison? I think it would be a lot easier to reason about it that way than by passing the response to the IntrospectionQuery into the Request. If that one change is enough to get similar results I'd prefer to do that.

I went ahead and merged that change as part of #131, so if you want to repeat your measurements with that version we can see how effective it is compared to the full change in your PR. I can try to take my own before and after measurements, but I won't be able to compare them directly to your environment or results.

barbieri · 2020-12-07T11:26:59Z

@wravery I'll spare some time to do the measurements and resolve the conflicts. I also have the reponse::ResultType in place on top of my branch that I need to measure and compare, but seems to be nicer on memory.

However just keeping the visitor as you did will not solve all the issues, namely it will not allow me to remove the introspection from production server, also there is no optimized lookup for types (as you keep the not-so-cheap maps, lookups, conversion to kind and generation of type-strings to make comparison simpler).

I think it would be a lot easier to reason about it that way than by passing the response to the IntrospectionQuery into the Request

Also, not sure you've noticed, but in my code you can still get the introspection if you do not provide the values, it's backward compatible. However in near future I plan to generate the introspection results in code, then we can avoid the introspection query, only building the results using response::Value

If that one change is enough to get similar results I'd prefer to do that.

Problem is that we're still too slow, at least compared to Apollo, we're about 3x slower, which in turn is like 2-3x slower than Go implementations. So we need some more work, my initial PR is not enough.

barbieri · 2020-12-07T11:58:25Z

Massif using fb4a589

    KB
401.4^                  #                                                     
     |                  #:::    :::                      @   :   :   :  :   : 
     |  @    ::   ::: ::#:: ::: :: :::@:::::::@::::::::::@::::::@:::::@:::::@:
     |  @:::::  :::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |  @:: ::  :::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |  @:: ::  :::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |  @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |  @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |  @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |  @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |::@:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |: @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |: @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |: @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |: @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |: @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |: @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |: @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |: @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |: @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
   0 +----------------------------------------------------------------------->Mi
     0                                                                   463.7

Number of snapshots: 82
 Detailed snapshots: [3, 18 (peak), 32, 39, 49, 59, 69, 79]

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1      5,002,462          219,232          209,971         9,261            0
  2     14,225,829          343,512          324,739        18,773            0
  3     18,831,299          380,896          358,031        22,865            0
...
 18    122,066,937          411,064          380,458        30,606            0
 19    126,340,408          381,168          353,213        27,955            0
 20    134,532,557          393,712          364,716        28,996            0
 21    141,563,410          392,360          363,638        28,722            0
...
 75    460,585,577          378,736          350,903        27,833            0
 76    464,859,164          374,920          347,463        27,457            0
 77    469,134,602          379,752          351,933        27,819            0
 78    473,414,161          391,240          362,541        28,699            0
 79    477,688,737          386,696          358,116        28,580            0
 80    481,964,540          374,200          346,823        27,377            0
 81    486,240,263          378,200          350,415        27,785            0

So it's running around 374kb, while my branch runs at 322Kb.

dhat:

Times {
  t-gmax: 125,212,274 instrs (25.56% of program duration)
  t-end:  489,932,197 instrs
}

This improved a bit compared to mine (224, 575) likely because of the lazy __Type query, but in my real usage case (and likely others in production) it's better to pay that price upfront instead of impact queries when they use a new type.

I'll rebase my work on top and see how both play along together (but still avoiding individual __Type queries)

Doing the introspection query all the time is hurting performance, this does not change, so a single query can be done with all the fields, build a validation tree and then query it on all validations. This is the first part, moving the top-level query to be read-only. The next commits will eliminate other `executeQuery()`, then a shared context will be created and hosted by the `Response` class, which can be discovered using introspection or fed using JSON (schema.json).

This is an incremental commit, just make use of the read-only data instead of `release` primitives, allows sharing the query results.

Split the fields getter and cache/insertion into the map so they can be used later in a different way. There should be no code differences, just moving around the internal branch to its own function. In the next commits, this will be removed from the getter, as it will be query-only as the types will be all cached beforehand.

wravery · 2020-12-07T17:12:02Z

However in near future I plan to generate the introspection results in code, then we can avoid the introspection query, only building the results using response::Value

Wouldn't you need to edit the generated files to do this? Alternatively, maybe there should be a switch for schemagen which suppresses the runtime Introspection fields.

If we modify the schemagen tool, we could also teach it to generate a static data structure with no parsing/serialization to be used with validation. It would take the responsibility of pre-caching any IntrospectionQuery off of the consumer and they wouldn't have to pass the extra response::Value. It should also be much faster since it wouldn't be operating on the response::Value type at all (part of what your PR has been doing, but without even the initial intake of the response::Value).

Problem is that we're still too slow, at least compared to Apollo, we're about 3x slower, which in turn is like 2-3x slower than Go implementations. So we need some more work, my initial PR is not enough.

Interesting, I never tried a direct comparison with either of those. Partly that's because I've been thinking of it as filling a different niche, specifically interop with existing C++ code in a hybrid web or React Native client. Upon handing off to the JS UI code, most of the native perf concerns become less relevant, they're generally orders of magnitude faster/cheaper just being native. I do mostly desktop development, so even Electron is generally fast enough.

So in your scenario, are you running just a GraphQL service on the device and handling the results elsewhere? Can you share a sample for either of those alternatives so I can see what we're up against?

This handles OBJECT, INTERFACE, UNION and INPUT_OBJECT types. It should have no behavior change, just moving code around. Minor adjustments were made to cope with the iterator return

It should have no behavior change, just moving code around.

This uses the information being queried in the introspection and allows the fields and input fields to be processed in one go.

This is another step to split the visitor from the lookup data structures, in the future the lookup will be shared.

ValidateExecutableVisitor was split into a lookup data structure (ValidationContext) and the actual visitor. The lookup data structure is shared across requests, saving queries and processing.

We do not need a map, there are only 3 well defined names

barbieri · 2020-12-07T23:35:50Z

However in near future I plan to generate the introspection results in code, then we can avoid the introspection query, only building the results using response::Value

Wouldn't you need to edit the generated files to do this? Alternatively, maybe there should be a switch for schemagen which suppresses the runtime Introspection fields.

I was thinking a #if SCHEMAGEN_DISABLE_INTROSPECTION == 1, so we can generate this kind of flag in the user code (autoconf/cmake).

If we modify the schemagen tool, we could also teach it to generate a static data structure with no parsing/serialization to be used with validation. It would take the responsibility of pre-caching any IntrospectionQuery off of the consumer and they wouldn't have to pass the extra response::Value. It should also be much faster since it wouldn't be operating on the response::Value type at all (part of what your PR has been doing, but without even the initial intake of the response::Value).

Yeah, this is my ultimate goal. I'm close to that in my PR (still cleaning up), first I'm working on the data structures (which I should push tomorrow or so), then I'll generate this validationContext directly in code.

Problem is that we're still too slow, at least compared to Apollo, we're about 3x slower, which in turn is like 2-3x slower than Go implementations. So we need some more work, my initial PR is not enough.

Interesting, I never tried a direct comparison with either of those. Partly that's because I've been thinking of it as filling a different niche, specifically interop with existing C++ code in a hybrid web or React Native client. Upon handing off to the JS UI code, most of the native perf concerns become less relevant, they're generally orders of magnitude faster/cheaper just being native. I do mostly desktop development, so even Electron is generally fast enough.

Usually, yes. But in my case there is no render being done, just data being normalized to another device (web, android, ...) where the GraphQL data is displayed.

So in your scenario, are you running just a GraphQL service on the device and handling the results elsewhere? Can you share a sample for either of those alternatives so I can see what we're up against?

Yes, this is an embedded device and it will normalize various sources as GraphQL queries, I cannot disclose much at this point (working for a customer under NDA), but we generate the resolvers to access some sources in C++. As the number of sources and properties are large and the hardware is underpowered, we did run into performance issues, that's why I'm trying to fix them.

wravery · 2020-12-07T23:57:07Z

I cannot disclose much at this point (working for a customer under NDA)

👍 No problem, I was just hoping you already had benchmarks you could share for Apollo or Go.

barbieri · 2020-12-08T16:04:53Z

@wravery pushed what I have done so far, changed what you pointed in the first review (commits were edited, so take a look at them again, there are no fixup commits)

I've reworked the ValidateType to be an abstract base class and added specialized classes ScalarType, EnumType, InputObjectType, ObjectType, InterfaceType and UnionType. This saves some memory (no need to store kind) as well as cleans up code as each virtual can implement its behavior (getInnerType(), etc).

Adding fields to input/objects were split into a second iteration, this way we know for sure the named types exist and will use those references.

Things like _scopedType are now all references to the actual types, this reduces the number of lookups.

Also did some work to reduce the memory allocations, moving some of the std::string to std::string_view. The remaining std::string exist for stuff that uses the introspection to be declared (enums, fields...). Once I work on the code generator, the introspection and these std::string should be gone.

This PR now contains the response::Type::Result, to keep the data, errors in a more efficient way.

barbieri · 2020-12-08T16:54:54Z

weird, on MacOS/clang it's not giving that error. I'll test on Linux

Instead of using a map with properties `name`, `kind` (string) and `ofType` (another map), use a set of custom classes with kind as enumeration and ofType as shared pointer. This allows much simpler handling and comparison, no need to serialize to string to make it simpler to compare. We can also store reference to types, know which kind (ie: isInputType?) and save memory by using references, in particular to common types such as Int, Float, String... The matchingTypes and fields are stored as part of each ValidateType specialization.

Instead of 2 maps + set (both ordered), use one single unordered_map (string_view) + unordered_set (pointer to definition). The string_view is okay since the ast_node tree is valid during the processing, so the references are valid. The pointer to definition is also okay for _referencedVariables, since the defintitions are all created upfront and the map (thus references) won't change while visiting the fields. The 2->1 map was possible since we're now storing the definition location instead of using a second map just to store the ast_node to query the position on errors.

pre-allocate a vector and populate it, then iterate directly instead of using a queue

Instead of always creating a recursive resolver, which in turn may call `std::async()`, only do that if the result is not readily available.

The wrap is not for free and is, more often than not, useless.

Just minor tweaks to make it compile, moving the template functions to the header and also marking virtuals as final. In the next commits it will be moved to more public usage, including the generator.

Soon there will be a generated ValidationContext, thus we don't need to carry any of the introspection bits.

This is basically moving code around, change the parameters to allow the request to receive the ValidationContext. Bring back the original Request() constructor so it will not break dynamically linked binaries. The introspection results are kept around, in the future the validation context will only use pointers to strings (string_view) and everything must be alive in order to work.

barbieri · 2020-12-10T04:50:31Z

@wravery finally got everything to work 😅 It became HUGE, the commits are not in the best order possible as I fixed some issues while I was reading/measuring the code paths.

The code generator now outputs ValidationContext that is given to Request(), in such case the introspection is not used. The legacy constructor still uses the introspection, so existing binaries should work.

You can skim through the commit messages to see all that was changed, but a summary is:

ValidationContext should be subclassed, one option is IntrospectionValidationContext (default), but the schemagen creates one.
response::Value::ResultType to store data/error pairs
reponse::Value is now uses an internal unique_ptr<TypeData> that is an abstract class. Specializations for all types make things a bit cleaner and smaller, no need to store type, Enum/String/JSONString are different classes
response::Value with complex data uses shared_ptr<> and Copy-on-Write semantics
FieldResult conversions are always deferred, regardless of the params.launch mode. Also the basic conversions were optimized, avoiding std::async() whenever possible.
do not convert to {data,errors} (response::Value::ResultType) unless needed, avoid the extra wrap.
std::queue (deque) replaced with std::list to process selections, less memory and also allows to join lists without loop-move-pop;
std::string_view replaces std::string whenever possible, Validation now runs fully on string_views (but the JSON introspection will keep the introspection results alive to allow that -- we could keep the strings and throw away the rest, but not done atm).

DHAT is impressive 125,212,274 -> 9,472,310 (max), with half of memory used by graphql::peg stuff.

Times {
  t-gmax: 9,472,310 instrs (2.79% of program duration)
  t-end:  339,659,840 instrs
}

This massif chart should get you a clear picture of the final results, there is no peak anymore and memory is stable at ~300Kb.

    KB
302.0^                 ::                                                     
     | ##:: ::@: ::::: : ::  ::  :::@: ::::::::::  ::::::::      ::::::::@::::
     | # : :: @::: ::::: ::::: :::::@::: ::::: ::::::: : ::::::::: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
   0 +----------------------------------------------------------------------->Mi
     0                                                                   323.0

Number of snapshots: 65
 Detailed snapshots: [2 (peak), 7, 26, 58]

barbieri · 2020-12-10T04:53:47Z

Damn, the test failed on CI, related to __typename, @if(), @include() and the other built-in stuff. Seems the schemagen is not generating those.

This avoids the introspection query and simplifies the build of lookup maps

The generated file contains "#ifdef SCHEMAGEN_DISABLE_INTROSPECTION", if that is set then the introspection blocks will be disabled: - no __schema and __type resolvers - no AddTypesToSchema - no _schema field

barbieri · 2020-12-10T17:12:03Z

@wravery now it includes all the introspection stuff (I did a complete generation, even if the current schema doesn't contain any enums or input types, if we add those in the future, the generator will just work.

It also includes #ifndef SCHEMAGEN_DISABLE_INTROSPECTION and can generate binaries without any introspection fields/resolvers. Added sample_nointrospection and nointrospection_tests to make sure those work.

barbieri · 2020-12-10T17:20:15Z

Running the sample without introspection, the results are:

DHAT 125,212,274 -> 6,522,087, 49% of the allocated memory is in graphqlpeg, followed by field_path using 25%.

Times {
  t-gmax: 6,522,087 instrs (1.94% of program duration)
  t-end:  336,125,099 instrs
}

Massif reports less than half memory used:

    KB
128.2^     :                                                                  
     | #   :    ::         :     ::           :               :@         ::   
     | # :@:::: : : :::::  : :: ::     : ::::::::       @::  ::@:  :::   :::  
     | #::@:: ::: ::::: :::::::::: ::@::::::: :: :::::::@::::::@:::::@:: :::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
   0 +----------------------------------------------------------------------->Mi
     0                                                                   318.9

Provide `push()` and `pop()` convenience methods so it's the same as `queue`. The `list.size()` is not as fast, however these lists are often small enough to not matter (walk the list counting the elements)

Change SelectionSetParams to keep an optional reference to the parent, this way we don't need to build the path over and over again just to add one element, instead we create this on demand. Since errorPath was accessed directly, this breaks the existing code, it became a method that dynamically computes the error path (recursive). This is important since we just pay the list copy price when there is an error and not in all field resolution.

Use this specific constructor in list converter, creating one itemParams with the new ownErrorPath, instead of changing the wrapper request param.

We shouldn't modify the parameters using a string is causing it to copy the field name, which is particularly bad when processing huge lists (it would copy the name for each item).

We don't change it anymore, we don't push to the array, then we can keep it inline in the parent structure, avoiding the extra allocation.

barbieri · 2020-12-10T22:04:10Z

@wravery with this last commit, peg is 68% of the allocation, everything else runs much smoother.

I'm running out of time to work on more optimizations, but if you know how to get peg to play nicer with memory, let me know

wravery · 2020-12-10T22:24:28Z

I'm running out of time to work on more optimizations, but if you know how to get peg to play nicer with memory, let me know

Sounds good. I'm going to make a cleanup pass to makes sure it's consistent with the rest of the project, and then I should be able to get it merged sometime this week.

Thanks for this contribution!

barbieri · 2020-12-11T19:04:18Z

I cannot disclose much at this point (working for a customer under NDA)

👍 No problem, I was just hoping you already had benchmarks you could share for Apollo or Go.

I forgot to reply to this one. I can't share much details due NDA, but raw numbers (C++ runs with std::launch::async, we'll change that later):

Test	Framework	Results
Flat Schema	Apollo/JS	11s
Flat Schema	C++ cached-validation	20s
Flat Schema	C++ Pre Optimizations	305s
Nested Schema	Apollo/JS	11s
Nested Schema	C++ cached-validation	22s
Nested Schema	C++ Pre Optimizations	110s

"Pre Optimizations" is 3add6d3, BEFORE your cached validation visitor. By far that was the biggest source of slowness. Just that helped a lot, however other changes like changing the converters to be more efficient, remove some useless std::launch::async and so on, also helped ... bit by bit.

This test is an artificially generated schemas, one is deeply nested, the other is a huge flat schema. We're querying 50 leaf fields, 500 queries over the network (HTTP/GET), using websocketpp in the C++ version.

We don't have it written in Go to say for sure, but given this https://github.com/appleboy/golang-graphql-benchmark and https://github.com/the-benchmarker/graphql-benchmarks/blob/develop/rates.md we can estimate how slow JS is compared to Go.

wravery · 2020-12-11T22:33:26Z

While reviewing this, I thought of another approach that I'd like to take. Rather than building a separate ValidationContext, I split the graphql::introspection types in Introspection.h into a compact, read-only set of structs with polymorphic implementations of BaseType (in the graphql::schema namespace inside a new file called GraphQLSchema.h), and I made all of the graphql::introspection types take those schema objects as a std::shared_ptr and just provide the service::Object accessor bindings that call through to the schema objects. Also, rather than relying on #ifdef

TL;DR; I'm not driving the validation through a separate hierarchy of cheaper validation objects, I'm driving both validation and introspection through a separate hierarchy of cheaper introspection objects.

Also, instead of using pre-processor directives, I added a --no-introspection switch to schemagen which will just generate the code without any of the declarations for the __schema and __type fields. In this mode, all it needs is the graphql::schema namespace objects, I still need to do some work to avoid linking the unused graphql::introspection namespace objects when they won't be used.

I have a little more cleanup on this approach to go, and I want to try rebase or merge some of your other fixes on top of that, but the memory savings seem very promising. I also noticed my unit tests run in about 2-3 times faster 🎉! Here's what I got from debug builds running on WSL 2 (using a benchmark program which I also added based on your PR):

massif against master

--------------------------------------------------------------------------------
Command:            ./build/samples/benchmark
Massif arguments:   (none)
ms_print arguments: massif.out.5376
--------------------------------------------------------------------------------


    KB
401.4^                  #
     |              :   # : :  :                      @:      ::      ::
     |       @:   ::: ::#@::::::::::::::@:::::@@::::::@:::::@:::::@:::::@::::@
     |   @:::@: ::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     | ::@:::@: ::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     | : @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     | : @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
   0 +----------------------------------------------------------------------->Mi
     0                                                                   477.7

massif against this PR I re-ran this after confirming my copy of this branch was up-to-date, and I'm still getting very different results from your last update, more in line with the previous update:

--------------------------------------------------------------------------------
Command:            ./build/samples/benchmark
Massif arguments:   (none)
ms_print arguments: massif.out.9027
--------------------------------------------------------------------------------


    KB
309.4^                                                    :
     |  #:::::::::::::@:::::::::::@:@:::::::@::::::::::::@::::::@::::::@::::::
     |  #::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     |  #::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     |  #::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     |  #::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     |  #::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
   0 +----------------------------------------------------------------------->Mi
     0                                                                   280.7

massif against my version

--------------------------------------------------------------------------------
Command:            ./samples/benchmark
Massif arguments:   (none)
ms_print arguments: massif.out.12238
--------------------------------------------------------------------------------


    KB
166.5^ ##
     | #      :        :: : :             ::       :      : :              ::
     | # ::   :     : :::::::     :::::  :: :::::: ::   :::::    :   : :  ::::
     | # @ @::::::::::::::::::::::: :: :::: :: : :::::::: ::::  ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::::::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
   0 +----------------------------------------------------------------------->Mi
     0                                                                   349.4

massif against my version with --no-introspection This is generating the compact schema representation, but it blocks loading the introspection::* objects on top of that.

--------------------------------------------------------------------------------
Command:            ./samples/benchmark_nointrospection
Massif arguments:   (none)
ms_print arguments: massif.out.12523
--------------------------------------------------------------------------------


    KB
161.0^ #
     | #     :::::: ::                     :::     ::  ::    :        :   :
     | #   :::: ::: :              : :   : : ::::  :   :: :  :  ::   ::  :: ::
     | #:::: :: ::::: ::::::::::::::@::::::: ::::::: :::::: :::::::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
   0 +----------------------------------------------------------------------->Mi
     0                                                                   327.3

The branch where I'm working on this is https://github.com/wravery/cppgraphqlgen/tree/merge-cached-validation. If you want to run any of your own tests on that you can.

Remaining work:

Finish auditing all of the uses of std::string_view vs. std::string. Most of the time we can avoid copying the strings in validation because they either come from a peg::ast parse tree which owns the memory or they are hardcoded literals built into the schema representation in the generated code. But there are a few cases where we build or alter the string, and it needs to be kept alive as a std::string without being implicitly converted to std::string_view and losing the temporary variable.
Merge the latest changes from this PR into my copy of that branch, and retake the measurements.
Merge/re-base some of the other optimizations from this branch, e.g. around result/error handling. Those should only apply to query resolution, not validation, since there are no more references to the graphql::response namespace except the variable value visitor.
Try splitting introspection support (what gets shut off with --no-introspection) into a separate library to reduce code size when it's not used. This shouldn't affect heap, but it may save file space and overall memory consumption.

barbieri · 2020-12-12T13:06:12Z

@wravery that's okay, just be sure to compare to my latest version, since you report massif against this PR: 128.2. Notice that was running the sample_nointrospection rather than sample.

As for the string x string_view, take a closer look and let me know. I did review them extensively, aside from ResolverParams everything else was basically private and never changed. I don't see any reason to modify the fieldName. Also for the errorPath, notice that since I changed the list processing, it stopped to push and modify per-item resolver -- rather it creates the per-item wrapper (as before) with the new ownErrorPath (new). This way we avoid the queue/list and build an implicit list referencing the parent SelectionSetParams. The errorPath is then generated on demand.

Meanwhile was triggered by the ast stuff, as I never worked with PEGTL it took me a while, but https://github.com/profusion/cppgraphqlgen/compare/cached-validation...profusion:parser-tweaks is evolving. What is left is a way to cache the ast_node, similar to https://github.com/taocpp/PEGTL/blob/master/src/example/pegtl/parse_tree.cpp#L149-L157 (its example is not thread-safe). However that branch offers a parseExecutableString() (which skips everything but executable_definition) and also a sample to output parser stuff, including tracing the query, output a graphviz-dot chart... which I find helpful.

Do you know of some simple way to get this memory arena/allocator pool done? Currently the PEGTL generates A LOT of useless nodes, every node is first built to later be discarded, which causes too much pressure on the memory allocator.

So far, without the arena/pool it removes around 10kb from peak:

    KB
118.5^ #                                                                      
     | #   :::::     :::::  ::::::@@@:@@:         ::::::::::@@:::  :: :::     
     | #:::: : ::::::: :::::::  ::@  :@ ::::::::::: : :: : :@ :: :::::: :::: :
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
   0 +----------------------------------------------------------------------->Mi
     0                                                                   297.0

Number of snapshots: 50
 Detailed snapshots: [1 (peak), 19, 21, 36]

wravery · 2020-12-12T21:41:30Z

Notice that was running the sample_nointrospection rather than sample.

Got it. I made a separate benchmark_nointrospection executable as well, and it doesn't reach your level of optimization yet, but it does help a little. After compressing response::Value I get this for benchmark:

    KB
165.1^ #
     | #  @            :     :::    : : :         :: ::: ::  ::     :     ::
     | #@@@  :     : : ::   :: ::: :: ::: :: :::  :: ::  ::  ::: : :::: @ :::
     | #@ @::::::::::::::::::: ::::::::::::::: ::::@@:: :::@ :::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
   0 +----------------------------------------------------------------------->Mi
     0                                                                   365.5

And this is benchmark_nointrospection:

    KB
159.6^ #
     | #:: :: ::                     :          ::  ::    @           :  :  :
     | #: :: :: :     ::: @@@ :::  ::::: :  :::::   :  :  @ :: :    :::  :  :
     | #: :: :: ::::: :: :@@ :: :::: :: :::::: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
   0 +----------------------------------------------------------------------->Mi
     0                                                                   343.6

Almost all of the savings apply with or without --no-introspection, but I should be more in line with your results after I pull in more of your optimizations.

Meanwhile was triggered by the ast stuff, as I never worked with PEGTL it took me a while, but profusion/cppgraphqlgen@cached-validation...profusion:parser-tweaks is evolving. What is left is a way to cache the ast_node, similar to https://github.com/taocpp/PEGTL/blob/master/src/example/pegtl/parse_tree.cpp#L149-L157 (its example is not thread-safe). However that branch offers a parseExecutableString() (which skips everything but executable_definition) and also a sample to output parser stuff, including tracing the query, output a graphviz-dot chart... which I find helpful.

Interesting, so it looks like this is meant to ignore the schema definition part of the grammar, correct? That makes sense for most purposes, and the same optimization could be used in reverse for the schema generator. However, the validation section of the spec specifically mentions rejecting documents for execution if they have any non-executable elements (and vice versa for schema definitions IIRC). This will convert an error about that specifically into a parser error. If you split the document rules into separate executable and schema documents, no single document should satisfy both, but it might be nice to add a handler for the parse error which checks to see if it matches the unified document grammar and converts that to the same error message about how they shouldn't be mixed together. It should even be possible to parse the grammar and see if it matches without executing any of the actions, so parsing against the unified document rule in the fallback would not need to construct an ast at all.

I would also swap the meaning of the two parse* methods, the one which parses a schema document should only be used internally by schemagen. It could even be pulled out of the main header and defined inside of SchemaGenerator.*. The schemagen tool doesn't need to be faster, so it could just keep using the full document grammar without defining a separate schema_document rule or fallback logic for the parse errors on a sub-grammar.

Do you know of some simple way to get this memory arena/allocator pool done? Currently the PEGTL generates A LOT of useless nodes, every node is first built to later be discarded, which causes too much pressure on the memory allocator.

I think you can still inherit from parse_tree::basic_node<ast_node>, it just needs to implement the operator new and operator delete overrides. I think most of the 10KB savings are coming from switching unescaped to a string_view, is that right? To plug those into a mempool, I think you'd want a regular node type which implements everything normally, but then create an inherited type that implements the operator new/delete type and define the parse tree in terms of the custom allocated type. The mempool would be defined as a container of the base type so it can manage its own memory, and the parse tree would allocate into the container with the overrides on the sub-type.

For simplicity, high throughput, and data locality, I would suggest using a std::deque for the node mempool:

As opposed to std::vector, the elements of a deque are not stored contiguously: typical implementations use a sequence of individually allocated fixed-size arrays, with additional bookkeeping, which means indexed access to deque must perform two pointer dereferences, compared to vector's indexed access which performs only one.

The storage of a deque is automatically expanded and contracted as needed. Expansion of a deque is cheaper than the expansion of a std::vector because it does not involve copying of the existing elements to a new memory location. On the other hand, deques typically have large minimal memory cost; a deque holding just one element has to allocate its full internal array (e.g. 8 times the object size on 64-bit libstdc++; 16 times the object size or 4096 bytes, whichever is larger, on 64-bit libc++).

The tradeoff is allocating a few extra nodes (based on the chunk size) to round up, but if they're empty that's not a lot, and it will greatly decrease the number of calls to the allocator in addition to making most adjacent nodes fit together in a cache line.

Since the actual parsing is single threaded, but the parsing might happen concurrently on multiple threads, you might try adding the thread_local storage specifier to a static mempool container to add thread-safety without needing any locks. The last time I needed it was on a C++11 project, and thread_local was new and not fully supported yet on our toolchain. But now that cppgraphqlgen requires C++17, the compilers which support that should support thread_local as well.

wravery · 2020-12-12T22:30:48Z

More random thoughts about the mempool:

The custom allocator sub-type should also store its index in the container, so perhaps it should include the ast node as a member instead of inheriting. To avoid shifting indices on delete, it should also leave itself (emptied of any state) in the container and add that index to a free-list/set for re-use.
It may make sense to either cleanup the container and free-list after each parse, or to keep it around and reuse the allocations in the next parse. If it doesn't deterministically cleanup after each parse, it ought to have a mechanism to explicitly cleanup the container, otherwise it won't recover the leftover memory from exceptionally large parse trees. This might be a good place to add an override, so if you call the version without a mempool it declares a per-parse mempool as a local variable and uses that, otherwise it uses the one you passed in and you can keep it alive however long you want.

wravery · 2020-12-12T23:01:40Z

As for the string x string_view, take a closer look and let me know. I did review them extensively, aside from ResolverParams everything else was basically private and never changed.

There are a lot of ways to misuse string_view and introduce memory errors, I hit a few of them myself even in my merge of this PR, so I'm cautious about taking a sweeping change to replace string with string_view.

Generally, replacing string with string_view is safe and effective, as long as you can guarantee the string_view will not outlive the buffer it's pointing to. It's really better suited in that respect to short-lived variables/parameters which go out of scope and are destroyed, but you can kind of cheat and use them indefinitely with hardcoded string literals since those are in the code/data segments of the executable and will never be freed. You can also use them in a bigger scope (e.g. as members) pointing to heap allocations as long as you can guarantee the lifetime of the the string_view is starts and ends within the lifetime of the heap buffer, e.g. when operating on a parse tree which out-lives the operation you're performing. All of this is pointing to arguments for writing everything in Rust. 😆 The good news is C++ static analyzers are constantly improving and can help detect potential issues like this in the future, but it's unlikely they'll ever be quite as safe as the Rust lifetime checker.

Couple of other points about std::string_view:

They should always be passed/returned by value. They're already effectively a pair of pointers, so you don't need the indirection of a reference or pointer to pass (or return) them efficiently on the stack. The only time you would need a reference is if you want to manipulate a string_view elsewhere, e.g. as an out-param or as a mutable member, but most of the time you should use a return type rather than an out-param.
Using const std::string_view is OK for local variables, just like it's good to make anything that doesn't change const, but since they're typically only passed/returned by value you should probably not declare a string_view parameter or return type as const because that puts needless constraints on the implementation. The exception would be a const std::string_view& which points to volatile but locally immutable data stored elsewhere.

wravery · 2020-12-14T11:54:28Z

For simplicity, high throughput, and data locality, I would suggest using a std::deque for the node mempool

Never mind, this was not as efficient as I hoped. The overall memory usage is higher and it's a little slower. I was able to plug in a std::array based cache which just holds on to pointers returned by the default allocator like in the PEGTL sample, and it seems to work fine with thread_local. But so far I'm not really seeing any measurable improvement from that, it may need to be bigger to make a difference (I went with 32 entries as in the sample), but then it increases the amount of memory we hold on to.

barbieri · 2020-12-14T13:06:07Z

For simplicity, high throughput, and data locality, I would suggest using a std::deque for the node mempool

Never mind, this was not as efficient as I hoped. The overall memory usage is higher and it's a little slower. I was able to plug in a std::array based cache which just holds on to pointers returned by the default allocator like in the PEGTL sample, and it seems to work fine with thread_local. But so far I'm not really seeing any measurable improvement from that, it may need to be bigger to make a difference (I went with 32 entries as in the sample), but then it increases the amount of memory we hold on to.

I tested that hack as well, but it was quickly, without checking the number of entries. What I found earlier during some prints is that the quick alloc-dealloc were hitting the malloc cache, so it was not going to the kernel.

However that reuse improved only once I reduced the ast_node size in my version (removed some useless fields, such as end), also helped a bit to make source a string_view and in the performance it helped to compare to typeid().hash_code() before doing the comparison of the demangled result (I don't have those exact reports at hand, but were something I noticed during the development).

wravery self-assigned this Dec 3, 2020

barbieri mentioned this pull request Dec 4, 2020

Changing MapData to a simple std::map? #130

Closed

wravery requested changes Dec 6, 2020

View reviewed changes

wravery added a commit to wravery/cppgraphqlgen that referenced this pull request Dec 6, 2020

Try a more limited version of microsoft#126

fbdad10

barbieri added 5 commits December 7, 2020 11:34

fix: INPUT_OBJECT is not a scalar type

ab0ec7f

ValidateExecutableVisitor::getScopedTypeFields() use read only data

eb93083

This is an incremental commit, just make use of the read-only data instead of `release` primitives, allows sharing the query results.

ValidateExecutableVisitor::getInputTypeFields() use read only data

d97f014

This is an incremental commit, just make use of the read-only data instead of `release` primitives, allows sharing the query results.

barbieri added 7 commits December 7, 2020 17:03

refactor ValidateExecutableVisitor to handle scalar/enum introspection

61738bc

refactor ValidateExecutableVisitor, split non-scalar introspection

2f1a802

This handles OBJECT, INTERFACE, UNION and INPUT_OBJECT types. It should have no behavior change, just moving code around. Minor adjustments were made to cope with the iterator return

refactor ValidateExecutableVisitor, split directives handling

6b44541

It should have no behavior change, just moving code around.

ValidateExecutableVisitor process fields and input fields in one go

633565e

This uses the information being queried in the introspection and allows the fields and input fields to be processed in one go.

ValidateExecutableVisitor doesn't hold service anymore

369c3a9

This is another step to split the visitor from the lookup data structures, in the future the lookup will be shared.

split ValidationContext into a shared class

3e5610d

ValidateExecutableVisitor was split into a lookup data structure (ValidationContext) and the actual visitor. The lookup data structure is shared across requests, saving queries and processing.

ValidationContext change operationTypes to a struct

e85f360

We do not need a map, there are only 3 well defined names

barbieri mentioned this pull request Dec 8, 2020

Avoid abusing response::MapType #127

Closed

barbieri added 2 commits December 8, 2020 13:58

barbieri added 7 commits December 9, 2020 16:40

optimize list result conversion

d8fdda7

pre-allocate a vector and populate it, then iterate directly instead of using a queue

optimize converters for FieldResult

cb52a63

Instead of always creating a recursive resolver, which in turn may call `std::async()`, only do that if the result is not readily available.

do not use ResultType unless really needed

6cbf8c8

The wrap is not for free and is, more often than not, useless.

split ValidationContext into a public file

89448cd

Just minor tweaks to make it compile, moving the template functions to the header and also marking virtuals as final. In the next commits it will be moved to more public usage, including the generator.

Split ValidationContext base class and IntrospectionValidationContext

f1033e4

Soon there will be a generated ValidationContext, thus we don't need to carry any of the introspection bits.

Validation is now fully string_view

499a0d3

barbieri added 2 commits December 10, 2020 11:44

schemagen now outputs the ValidationContext

801ac3e

This avoids the introspection query and simplifies the build of lookup maps

schemagen allow introspection to be disabled

5dcd12e

The generated file contains "#ifdef SCHEMAGEN_DISABLE_INTROSPECTION", if that is set then the introspection blocks will be disabled: - no __schema and __type resolvers - no AddTypesToSchema - no _schema field

barbieri added 5 commits December 10, 2020 14:44

change field_path to list and make it a const-ref in SelectionVisitor

25afd86

Provide `push()` and `pop()` convenience methods so it's the same as `queue`. The `list.size()` is not as fast, however these lists are often small enough to not matter (walk the list counting the elements)

Provide SelectionSetParams constructor with parent + ownErrorPath

ec460f5

Use this specific constructor in list converter, creating one itemParams with the new ownErrorPath, instead of changing the wrapper request param.

BREAKING ResolverParams now uses string view

78ca2c9

We shouldn't modify the parameters using a string is causing it to copy the field name, which is particularly bad when processing huge lists (it would copy the name for each item).

ownErrorPath is now a const path_segment

a142c6c

We don't change it anymore, we don't push to the array, then we can keep it inline in the parent structure, avoiding the extra allocation.

wravery mentioned this pull request Dec 14, 2020

Merge cached validation #133

Merged

wravery closed this in #133 Dec 14, 2020

Cached validation #126

Cached validation #126

Uh oh!

Conversation

barbieri commented Dec 2, 2020

Introduction

Test Environment

valgrind --tool=massif

Pristine results (commit: 3add6d3)

Cached Validation Results

valgrind --tool=dhat

Pristine results (commit: 3add6d3)

Cached Validation Results

Uh oh!

ghost commented Dec 2, 2020 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wravery commented Dec 3, 2020

Uh oh!

barbieri commented Dec 3, 2020

Uh oh!

wravery left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wravery commented Dec 6, 2020

Uh oh!

barbieri commented Dec 7, 2020

Uh oh!

barbieri commented Dec 7, 2020

Uh oh!

wravery commented Dec 7, 2020

Uh oh!

barbieri commented Dec 7, 2020

Uh oh!

wravery commented Dec 7, 2020

Uh oh!

barbieri commented Dec 8, 2020

Uh oh!

barbieri commented Dec 8, 2020

Uh oh!

barbieri commented Dec 10, 2020

Uh oh!

barbieri commented Dec 10, 2020

Uh oh!

barbieri commented Dec 10, 2020

Uh oh!

barbieri commented Dec 10, 2020

Uh oh!

barbieri commented Dec 10, 2020

Uh oh!

wravery commented Dec 10, 2020

Uh oh!

barbieri commented Dec 11, 2020

Uh oh!

wravery commented Dec 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

barbieri commented Dec 12, 2020

Uh oh!

wravery commented Dec 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Pristine results (commit: `3add6d3`)

Pristine results (commit: `3add6d3`)

ghost commented Dec 2, 2020 •

edited by ghost

Loading

wravery commented Dec 11, 2020 •

edited

Loading

wravery commented Dec 12, 2020 •

edited

Loading

wravery commented Dec 12, 2020 •

edited

Loading