Skip to content

Conversation

@barbieri
Copy link
Contributor

@barbieri barbieri commented Dec 2, 2020

Introduction

cppgraphqlgen validates each against a schema before it's executed.

Most tools (ie: Apollo) uses a given schema to work on, most will load it using a JSON file resulting from an introspection query (usually called schema.json).

However cppgraphqlgen does this by doing an introspection query prior to execute each query. This is bad for two reasons:

  1. it requires the server to have introspection enabled, this may not be desired in production environments.
  2. the schema is constant and the validation should reuse the schema, avoiding the introspection query on each user-query.

In particular the last point is hurting us, since for large schemas (or small user queries) the cost of each introspection is way larger than the user query itself, so the validation takes more time than the execution.

This PR moves the schema information used to do validation to a class ValidationContext that is hosted by graphql::service::Request and shared to each ValidateExecutableVisitor used.

The ValidationContext can be created using an introspection query on the service or a response::Value with the results of such query, useful for people using parseJSON().

In addition to that, ValidateType is a specific structure, instead of using the more expensive response::Value, this allows less memory usage and faster/simpler usage due direct access to kind (as enum), name and ofType (as shared pointer).

Test Environment

The following tests were executed on Ubuntu 20.04.1 LTS (Focal Fossa) running on a docker on MacOS X.

Compiled with 9.3.0 and running kernel 5.4.39-linuxkit.

The test was done using a loop of 100 elements inside samples/today/sample.cpp:

diff --git a/samples/today/sample.cpp b/samples/today/sample.cpp
index 376a730..cb5e2b1 100644
--- a/samples/today/sample.cpp
+++ b/samples/today/sample.cpp
@@ -59,6 +59,7 @@ int main(int argc, char** argv)
 
        std::cout << "Created the service..." << std::endl;
 
+       for (int i=0; i < 100; i++) {
        try
        {
                peg::ast query;
@@ -97,6 +98,7 @@ int main(int argc, char** argv)
                std::cerr << ex.what() << std::endl;
                return 1;
        }
+       }
 
        return 0;
 }

The input query:

query {
    appointments {
        pageInfo { hasNextPage }
        edges {
            node {
                id
                when
                subject
                isNow
            }
        }
    }
}

valgrind --tool=massif

Valgrind provides a heap profiler called Massif.

As seen below in ms_print results, the introspection query costs much more than the actual query (since it's small). If we keep redoing the introspection on each query, we keep memory pressure. This also may lead to fragmentation that causes more memory to be used, as seen in the final snapshot of each:

Code Initial (B) Final (B) Peak (B)
Pristine 413,232 372,936 413,232
Cached 273,536 322,088 695,248

Notice there is a higher peak using Cached since the introspection is handled as read-only (values are not released as the ValidationContext is built). This was done to enable the response::Value to be used elsewhere, as well as being loaded from parseJSON().

However, after the peak (695,248) it goes down (324,736) and remains mostly stable. Around 50KB smaller than the Pristine solution.

The results were edited to present only the most relevant information.

Pristine results (commit: 3add6d3)

    KB
403.5^ #
     | #           :                                   ::  :    :          ::
     | #::         :   ::::  ::   :         ::       : : :::    :  :: @: : ::@
     | #::   ::  ::::::::: :@:::  :  : @:: :::::     ::: : ::   ::::::@::: ::@
     | #:::::: : : ::: ::: :@:::  :  ::@:  ::::   :: ::: : ::   :: :::@::: ::@
     | #::: :: : : ::: ::: :@:::  :: ::@:  ::::   :  ::: : ::  ::: :::@::: ::@
     | #::: :: : : ::: ::: :@:::  :: ::@: ::::: ::: :::: : ::  ::: :::@::: ::@
     | #::: :: ::: ::: ::: :@::::::::::@: ::::: : : :::: : ::::::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
     | #::: :: ::: ::: ::: :@:::: :::::@: ::::: : : :::: : ::: ::: :::@::::::@
   0 +----------------------------------------------------------------------->Gi
     0                                                                   11.34

Number of snapshots: 65
 Detailed snapshots: [1 (peak), 18, 28, 53, 63]

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1    247,399,635          413,232          382,134        31,098            0

...

 54 11,041,472,348          374,496          349,680        24,816            0
 55 11,155,054,291          359,776          336,956        22,820            0
 56 11,268,636,351          367,192          344,611        22,581            0
 57 11,382,217,337          380,120          357,087        23,033            0
 58 11,495,799,440          317,968          300,831        17,137            0
 59 11,609,381,341          275,864          262,568        13,296            0
 60 11,722,965,210          394,488          365,022        29,466            0
 61 11,836,546,650          372,120          345,123        26,997            0
 62 11,950,130,852          399,536          370,048        29,488            0
 63 12,063,713,851          380,104          352,716        27,388            0
 64 12,177,294,258          372,936          346,969        25,967            0

Cached Validation Results

    KB
679.0^                            #
     |                           :#
     |                       @@:::#
     |                    @@:@ :::#
     |                  ::@@:@ :::#
     |             ::@::::@@:@ :::#
     |             ::@ :::@@:@ :::#
     |       @ ::::::@ :::@@:@ :::#
     |     ::@:::: ::@ :::@@:@ :::#
     |   ::::@:::: ::@ :::@@:@ :::#
     |  :: ::@:::: ::@ :::@@:@ :::#
     |  :: ::@:::: ::@ :::@@:@ :::#:@::::::::::@:::::::@:::@::::::@::::::@::::
     | ::: ::@:::: ::@ :::@@:@ :::#:@::::::::: @:::::: @: :@::::::@::::::@::::
     |@::: ::@:::: ::@ :::@@:@ :::#:@::::::::: @:::::: @: :@::::::@::::::@::::
     |@::: ::@:::: ::@ :::@@:@ :::#:@::::::::: @:::::: @: :@::::::@::::::@::::
     |@::: ::@:::: ::@ :::@@:@ :::#:@::::::::: @:::::: @: :@::::::@::::::@::::
     |@::: ::@:::: ::@ :::@@:@ :::#:@::::::::: @:::::: @: :@::::::@::::::@::::
     |@::: ::@:::: ::@ :::@@:@ :::#:@::::::::: @:::::: @: :@::::::@::::::@::::
     |@::: ::@:::: ::@ :::@@:@ :::#:@::::::::: @:::::: @: :@::::::@::::::@::::
     |@::: ::@:::: ::@ :::@@:@ :::#:@::::::::: @:::::: @: :@::::::@::::::@::::
   0 +----------------------------------------------------------------------->Mi
     0                                                                   547.7

Number of snapshots: 78
 Detailed snapshots: [1, 7, 14, 19, 20, 22, 26 (peak), 28, 38, 46, 50, 60, 70]

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1      6,099,963          273,536          260,678        12,858            0
  2     13,630,792          303,472          286,886        16,586            0
  3     20,540,448          364,544          343,318        21,226            0
  4     29,108,693          390,576          365,894        24,682            0
  5     39,953,980          418,056          390,071        27,985            0
  6     49,004,409          419,864          391,297        28,567            0
  7     60,393,483          453,360          421,343        32,017            0
  8     70,792,208          438,912          406,857        32,055            0
  9     79,450,974          470,392          435,255        35,137            0
 10     86,750,721          462,080          426,707        35,373            0
 11     94,428,296          480,288          442,930        37,358            0
 12    105,685,674          522,360          480,294        42,066            0
 13    114,027,797          530,872          487,348        43,524            0
 14    120,219,351          544,840          499,775        45,065            0
 15    127,206,286          549,008          502,891        46,117            0
 16    136,595,824          554,976          507,668        47,308            0
 17    147,169,402          565,392          515,942        49,450            0
 18    153,538,644          566,696          516,510        50,186            0
 19    162,028,276          593,160          540,231        52,929            0
 20    172,315,488          613,064          557,830        55,234            0
 21    179,148,247          600,184          544,860        55,324            0
 22    189,592,002          632,832          574,212        58,620            0
 23    201,832,220          648,744          587,692        61,052            0
 24    210,197,711          654,488          591,920        62,568            0
 25    215,645,861          668,456          604,413        64,043            0
 26    223,665,287          695,248          625,625        69,623            0
 27    231,374,858          324,736          304,885        19,851            0

...

 70    536,166,008          322,616          303,149        19,467            0
 71    541,614,660          328,928          308,747        20,181            0
 72    547,064,031          324,880          305,069        19,811            0
 73    552,514,315          339,880          318,987        20,893            0
 74    557,967,268          326,800          306,789        20,011            0
 75    563,416,310          339,264          318,226        21,038            0
 76    568,866,536          322,088          302,485        19,603            0
 77    574,314,700          293,632          277,700        15,932            0

valgrind --tool=dhat

Valgrind provides a dynamic heap analysis tool called DHAT.

As seen in the results from dh_view.html, while we have a peak (t-gmax: 629Kb) as explained in valgrind --tool=massif, we have much less bytes used, with much less reads and writes (total), and increased memory utilization (reads/writes are greater than total):

Code Used (B) Read (B) Write (B)
Pristine 578,735,537 418,677,007 422,701,596
Cached 25,788,030 26,349,575 26,274,210

That's 22 times less memory used and around 15 times less memory being read and write.

Comparing most important allocation origins:

graphql::service::SelectionSetParams::SelectionSetParams(...) (GraphQLService.h:144)
graphql::service::ResolverParams::ResolverParams(...) (GraphQLService.h:233)
Code AP Total (B) Max (B)
Pristine 1.1.2.1.1 87,456,000 3,360
Cached 1.1.4.1.1.1 3,161,280 2,880

27 times better.


graphql::service::SelectionVisitor::visitField(...) (GraphQLService.cpp:1044)
graphql::service::SelectionVisitor::visit(...) (GraphQLService.cpp:940)
Code AP Total (B) Max (B)
Pristine 1.1.2.1.2 29,664,000 480
Cached 1.1.4.1.1.2 1,044,960 480

28 times better.


graphql::service::SelectionSetParams::SelectionSetParams(...) (GraphQLService.h:144)
graphql::service::ResolverParams::ResolverParams(...) (GraphQLService.cpp:504)
graphql::service::SelectionVisitor::visitField(...) (GraphQLService.cpp:1048)
Code AP Total (B) Max (B)
Pristine 1.1.2.2.1.1 29,664,000 4,800
Cached 1.1.4.1.2.1.1 1,049,280 10,080

28 times better.


graphql::service::SelectionSetParams::SelectionSetParams(...) (GraphQLService.h:144)
graphql::service::FieldParams::FieldParams(...) (GraphQLService.cpp:164)
Code AP Total (B)
Pristine 1.1.2.2.1.2 29,376,000
Cached 1.1.4.1.2.1.2 1,048,800

28 times better.

Note: no Max since it's a block with only insignificant children.


graphql::service::SelectionVisitor::visitField(...) (GraphQLService.cpp:1031)
graphql::service::SelectionVisitor::visit(...) (GraphQLService.cpp:940)
graphql::service::Object::resolve(...) (GraphQLService.cpp:1241)
Code AP Total (B)
Pristine 1.1.2.2.2.1 17,712,000
Cached 1.1.4.1.2.2 1,044,960

17 times better.

Note: no Max since it's a block with only insignificant children.


The results were edited to present only the most relevant information.

Pristine results (commit: 3add6d3)

Times {
  t-gmax: 247,395,291 instrs (2.02% of program duration)
  t-end:  12,222,351,721 instrs
}

▼ AP 1/1 (2 children) {
    Total:     578,735,537 bytes (100%, 47,350.59/Minstr) in 3,342,560 blocks (100%, 273.48/Minstr), avg size 173.14 bytes, avg lifetime 7,428,371.41 instrs (0.06% of program duration)
    Reads:     418,677,007 bytes (100%, 34,255.03/Minstr), 0.72/byte
    Writes:    422,701,596 bytes (100%, 34,584.31/Minstr), 0.73/byte
  ├─▼ AP 1.1/2 (14 children) {
  │     Total:     578,582,641 bytes (99.97%, 47,338.08/Minstr) in 3,342,355 blocks (99.99%, 273.46/Minstr), avg size 173.11 bytes, avg lifetime 7,410,581.76 instrs (0.06% of program duration)
  │     Reads:     418,520,722 bytes (99.96%, 34,242.24/Minstr), 0.72/byte
  │     Writes:    422,586,724 bytes (99.97%, 34,574.91/Minstr), 0.73/byte
  │   ├─▶ AP 1.1.2/14 (3 children) {
  │   │     Total:     234,816,000 bytes (40.57%, 19,212.01/Minstr) in 489,200 blocks (14.64%, 40.03/Minstr), avg size 480 bytes, avg lifetime 399,785.49 instrs (0% of program duration)
  │   │     Reads:     51,090,800 bytes (12.2%, 4,180.11/Minstr), 0.22/byte
  │   │     Writes:    31,157,500 bytes (7.37%, 2,549.22/Minstr), 0.13/byte
  │   │   ├─▼ AP 1.1.2.1/3 (3 children) {
  │   │   │     Total:     117,264,000 bytes (20.26%, 9,594.23/Minstr) in 244,300 blocks (7.31%, 19.99/Minstr), avg size 480 bytes, avg lifetime 71,559.19 instrs (0% of program duration)
  │   │   │   ├── AP 1.1.2.1.1/3 {
  │   │   │   │     Total:     87,456,000 bytes (15.11%, 7,155.42/Minstr) in 182,200 blocks (5.45%, 14.91/Minstr), avg size 480 bytes, avg lifetime 82,661.48 instrs (0% of program duration)
  │   │   │   │     Max:       3,360 bytes in 7 blocks, avg size 480 bytes
  │   │   │   │       #11: 0x194294: graphql::service::SelectionSetParams::SelectionSetParams(...) (GraphQLService.h:144)
  │   │   │   │       #12: 0x1942D2: graphql::service::ResolverParams::ResolverParams(...) (GraphQLService.h:233)
  │   │   │   ├── AP 1.1.2.1.2/3 {
  │   │   │   │     Total:     29,664,000 bytes (5.13%, 2,427.03/Minstr) in 61,800 blocks (1.85%, 5.06/Minstr), avg size 480 bytes, avg lifetime 39,097.38 instrs (0% of program duration)
  │   │   │   │     Max:       480 bytes in 1 blocks, avg size 480 bytes
  │   │   │   │       #11: 0x29E4F4: graphql::service::SelectionVisitor::visitField(...) (GraphQLService.cpp:1044)
  │   │   │   │       #12: 0x29DABA: graphql::service::SelectionVisitor::visit(...) (GraphQLService.cpp:940)
  │   │   ├─▼ AP 1.1.2.2/3 (3 children) {
  │   │   │     Total:     117,216,000 bytes (20.25%, 9,590.3/Minstr) in 244,200 blocks (7.31%, 19.98/Minstr), avg size 480 bytes, avg lifetime 687,649.65 instrs (0.01% of program duration)
  │   │   │     Reads:     51,090,800 bytes (12.2%, 4,180.11/Minstr), 0.44/byte
  │   │   │     Writes:    31,157,500 bytes (7.37%, 2,549.22/Minstr), 0.27/byte
  │   │   │   ├─▼ AP 1.1.2.2.1/3 (3 children) {
  │   │   │   │     Total:     65,856,000 bytes (11.38%, 5,388.16/Minstr) in 137,200 blocks (4.1%, 11.23/Minstr), avg size 480 bytes, avg lifetime 1,160,216.45 instrs (0.01% of program duration)
  │   │   │   │     Reads:     27,983,600 bytes (6.68%, 2,289.54/Minstr), 0.42/byte
  │   │   │   │     Writes:    18,299,300 bytes (4.33%, 1,497.2/Minstr), 0.28/byte
  │   │   │   │   ├── AP 1.1.2.2.1.1/3 {
  │   │   │   │   │     Total:     29,664,000 bytes (5.13%, 2,427.03/Minstr) in 61,800 blocks (1.85%, 5.06/Minstr), avg size 480 bytes, avg lifetime 814,831.01 instrs (0.01% of program duration)
  │   │   │   │   │     Max:       4,800 bytes in 10 blocks, avg size 480 bytes
  │   │   │   │   │     Reads:     17,508,200 bytes (4.18%, 1,432.47/Minstr), 0.59/byte
  │   │   │   │   │     Writes:    8,622,100 bytes (2.04%, 705.44/Minstr), 0.29/byte
  │   │   │   │   │       ^10: 0x1B6946: graphql::service::SelectionSetParams::SelectionSetParams(...) (GraphQLService.h:144)
  │   │   │   │   │       #11: 0x29B195: graphql::service::ResolverParams::ResolverParams(...) (GraphQLService.cpp:504)
  │   │   │   │   │       #12: 0x29E5C6: graphql::service::SelectionVisitor::visitField(...) (GraphQLService.cpp:1048)
  │   │   │   │   ├─▼ AP 1.1.2.2.1.2/3 (1 children) {
  │   │   │   │   │     Total:     29,376,000 bytes (5.08%, 2,403.47/Minstr) in 61,200 blocks (1.83%, 5.01/Minstr), avg size 480 bytes, avg lifetime 11,348.13 instrs (0% of program duration)
  │   │   │   │   │     Reads:     6,589,400 bytes (1.57%, 539.13/Minstr), 0.22/byte
  │   │   │   │   │     Writes:    8,276,600 bytes (1.96%, 677.17/Minstr), 0.28/byte
  │   │   │   │   │       ^10: 0x1B6946: graphql::service::SelectionSetParams::SelectionSetParams(...) (GraphQLService.h:144)
  │   │   │   │   │       #11: 0x299141: graphql::service::FieldParams::FieldParams(...) (GraphQLService.cpp:164)
  │   │   │   ├─▼ AP 1.1.2.2.2/3 (2 children) {
  │   │   │   │     Total:     29,664,000 bytes (5.13%, 2,427.03/Minstr) in 61,800 blocks (1.85%, 5.06/Minstr), avg size 480 bytes, avg lifetime 45,409.13 instrs (0% of program duration)
  │   │   │   │     Reads:     13,660,500 bytes (3.26%, 1,117.67/Minstr), 0.46/byte
  │   │   │   │     Writes:    8,636,900 bytes (2.04%, 706.65/Minstr), 0.29/byte
  │   │   │   │   ├── AP 1.1.2.2.2.1/2 {
  │   │   │   │   │     Total:     17,712,000 bytes (3.06%, 1,449.15/Minstr) in 36,900 blocks (1.1%, 3.02/Minstr), avg size 480 bytes, avg lifetime 43,341.1 instrs (0% of program duration)
  │   │   │   │   │       ^10: 0x29E3FC: graphql::service::SelectionVisitor::visitField(...) (GraphQLService.cpp:1031)
  │   │   │   │   │       ^11: 0x29DABA: graphql::service::SelectionVisitor::visit(...) (GraphQLService.cpp:940)
  │   │   │   │   │       #12: 0x2A0F3F: graphql::service::Object::resolve(...) (GraphQLService.cpp:1241)


AP significance threshold: total >= 33,425.6 blocks (1%)

Cached Validation Results

Times {
  t-gmax: 224,143,986 instrs (38.98% of program duration)
  t-end:  575,074,817 instrs
}

▼ AP 1/1 (2 children) {
    Total:     25,788,030 bytes (100%, 44,842.91/Minstr) in 152,884 blocks (100%, 265.85/Minstr), avg size 168.68 bytes, avg lifetime 8,901,103.72 instrs (1.55% of program duration)
    Reads:     26,349,575 bytes (100%, 45,819.39/Minstr), 1.02/byte
    Writes:    26,274,210 bytes (100%, 45,688.33/Minstr), 1.02/byte
  ├─▼ AP 1.1/2 (14 children) {
  │     Total:     25,635,134 bytes (99.41%, 44,577.04/Minstr) in 152,679 blocks (99.87%, 265.49/Minstr), avg size 167.9 bytes, avg lifetime 8,897,176.13 instrs (1.55% of program duration)
  │     Reads:     26,193,290 bytes (99.41%, 45,547.62/Minstr), 1.02/byte
  │     Writes:    26,159,338 bytes (99.56%, 45,488.58/Minstr), 1.02/byte
  │   ├─▼ AP 1.1.4/14 (2 children) {
  │   │     Total:     8,493,120 bytes (32.93%, 14,768.72/Minstr) in 17,694 blocks (11.57%, 30.77/Minstr), avg size 480 bytes, avg lifetime 443,031.55 instrs (0.08% of program duration)
  │   │     Reads:     1,822,343 bytes (6.92%, 3,168.88/Minstr), 0.21/byte
  │   │     Writes:    1,118,090 bytes (4.26%, 1,944.25/Minstr), 0.13/byte
  │   │   ├─▼ AP 1.1.4.1/2 (3 children) {
  │   │   │     Total:     8,488,800 bytes (32.92%, 14,761.21/Minstr) in 17,685 blocks (11.57%, 30.75/Minstr), avg size 480 bytes, avg lifetime 443,226.75 instrs (0.08% of program duration)
  │   │   │     Reads:     1,822,343 bytes (6.92%, 3,168.88/Minstr), 0.21/byte
  │   │   │     Writes:    1,118,090 bytes (4.26%, 1,944.25/Minstr), 0.13/byte
  │   │   │   ├─▼ AP 1.1.4.1.1/3 (3 children) {
  │   │   │   │     Total:     4,350,240 bytes (16.87%, 7,564.65/Minstr) in 9,063 blocks (5.93%, 15.76/Minstr), avg size 480 bytes, avg lifetime 68,924.73 instrs (0.01% of program duration)
  │   │   │   │   ├── AP 1.1.4.1.1.1/3 {
  │   │   │   │   │     Total:     3,161,280 bytes (12.26%, 5,497.16/Minstr) in 6,586 blocks (4.31%, 11.45/Minstr), avg size 480 bytes, avg lifetime 80,833.43 instrs (0.01% of program duration)
  │   │   │   │   │     Max:       2,880 bytes in 6 blocks, avg size 480 bytes
  │   │   │   │   │       #11: 0x1942F8: graphql::service::SelectionSetParams::SelectionSetParams(...) (GraphQLService.h:144)
  │   │   │   │   │       #12: 0x194336: graphql::service::ResolverParams::ResolverParams(...) (GraphQLService.h:233)
  │   │   │   │   ├── AP 1.1.4.1.1.2/3 {
  │   │   │   │   │     Total:     1,044,960 bytes (4.05%, 1,817.09/Minstr) in 2,177 blocks (1.42%, 3.79/Minstr), avg size 480 bytes, avg lifetime 40,092.55 instrs (0.01% of program duration)
  │   │   │   │   │     Max:       480 bytes in 1 blocks, avg size 480 bytes
  │   │   │   │   │       #11: 0x29E558: graphql::service::SelectionVisitor::visitField(...) (GraphQLService.cpp:1044)
  │   │   │   │   │       #12: 0x29DB1E: graphql::service::SelectionVisitor::visit(...) (GraphQLService.cpp:940)
  │   │   │   ├─▼ AP 1.1.4.1.2/3 (3 children) {
  │   │   │   │     Total:     4,090,080 bytes (15.86%, 7,112.26/Minstr) in 8,521 blocks (5.57%, 14.82/Minstr), avg size 480 bytes, avg lifetime 804,305.27 instrs (0.14% of program duration)
  │   │   │   │     Reads:     1,822,343 bytes (6.92%, 3,168.88/Minstr), 0.45/byte
  │   │   │   │     Writes:    1,118,090 bytes (4.26%, 1,944.25/Minstr), 0.27/byte
  │   │   │   │   ├─▼ AP 1.1.4.1.2.1/3 (3 children) {
  │   │   │   │   │     Total:     2,245,920 bytes (8.71%, 3,905.44/Minstr) in 4,679 blocks (3.06%, 8.14/Minstr), avg size 480 bytes, avg lifetime 1,398,266.28 instrs (0.24% of program duration)
  │   │   │   │   │     Reads:     979,504 bytes (3.72%, 1,703.26/Minstr), 0.44/byte
  │   │   │   │   │     Writes:    635,763 bytes (2.42%, 1,105.53/Minstr), 0.28/byte
  │   │   │   │   │   ├── AP 1.1.4.1.2.1.1/3 {
  │   │   │   │   │   │     Total:     1,049,280 bytes (4.07%, 1,824.6/Minstr) in 2,186 blocks (1.43%, 3.8/Minstr), avg size 480 bytes, avg lifetime 1,126,835.31 instrs (0.2% of program duration)
  │   │   │   │   │   │     Max:       10,080 bytes in 21 blocks, avg size 480 bytes
  │   │   │   │   │   │       ^10: 0x1B69AA: graphql::service::SelectionSetParams::SelectionSetParams(...) (GraphQLService.h:144)
  │   │   │   │   │   │       #11: 0x29B1F9: graphql::service::ResolverParams::ResolverParams(...) (GraphQLService.cpp:504)
  │   │   │   │   │   │       #12: 0x29E62A: graphql::service::SelectionVisitor::visitField(...) (GraphQLService.cpp:1048)
  │   │   │   │   │   ├─▼ AP 1.1.4.1.2.1.2/3 (1 children) {
  │   │   │   │   │   │     Total:     1,048,800 bytes (4.07%, 1,823.76/Minstr) in 2,185 blocks (1.43%, 3.8/Minstr), avg size 480 bytes, avg lifetime 14,699.06 instrs (0% of program duration)
  │   │   │   │   │   │     Reads:     231,438 bytes (0.88%, 402.45/Minstr), 0.22/byte
  │   │   │   │   │   │     Writes:    297,532 bytes (1.13%, 517.38/Minstr), 0.28/byte
  │   │   │   │   │   │       ^10: 0x1B69AA: graphql::service::SelectionSetParams::SelectionSetParams(...) (GraphQLService.h:144)
  │   │   │   │   │   │       #11: 0x2991A5: graphql::service::FieldParams::FieldParams(...) (GraphQLService.cpp:164)
  │   │   │   │   ├─▼ AP 1.1.4.1.2.2/3 (1 children) {
  │   │   │   │   │     Total:     1,044,960 bytes (4.05%, 1,817.09/Minstr) in 2,177 blocks (1.42%, 3.79/Minstr), avg size 480 bytes, avg lifetime 46,384.37 instrs (0.01% of program duration)
  │   │   │   │   │     Reads:     485,500 bytes (1.84%, 844.24/Minstr), 0.46/byte
  │   │   │   │   │     Writes:    309,304 bytes (1.18%, 537.85/Minstr), 0.3/byte
  │   │   │   │   │       #10: 0x29E460: graphql::service::SelectionVisitor::visitField(graphql::peg::ast_node const&) (GraphQLService.cpp:1031)
  │   │   │   │   │       #11: 0x29DB1E: graphql::service::SelectionVisitor::visit(graphql::peg::ast_node const&) (GraphQLService.cpp:940)

AP significance threshold: total >= 1,528.84 blocks (1%)

@ghost
Copy link

ghost commented Dec 2, 2020

CLA assistant check
All CLA requirements met.

@wravery
Copy link
Contributor

wravery commented Dec 3, 2020

Great job on the performance analysis! I'll try to get this reviewed and merged soon.

@wravery wravery self-assigned this Dec 3, 2020
@barbieri
Copy link
Contributor Author

barbieri commented Dec 3, 2020

Thanks @wravery, we're using this in an embedded system and it's very sensitive to memory pressure.

I have other patches in my https://github.com/profusion/cppgraphqlgen/tree/perf, but I'll wait this one to be reviewed and merged so I can rebase and propose the other PR.

Copy link
Contributor

@wravery wravery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a lot of the benefits of this change could be achieved just by keeping a std::unique_ptr<ValidateExecutableVisitor> alive in a member on the Request. Would you try that change as a smaller comparison? I think it would be a lot easier to reason about it that way than by passing the response to the IntrospectionQuery into the Request. If that one change is enough to get similar results I'd prefer to do that.

BTW, depending on your scenario and how much query caching you can do, you might be able to erase the impact of validation entirely after an initial parse. The sample test case would have 0 overhead for validation after the first iteration if the peg::ast variable was declared object outside of the loop. After validation it sets ast.validated to true and it validation after that.

std::shared_ptr<ValidateType> type;

ValidateArgument() = default;
ValidateArgument(std::shared_ptr<ValidateType>& type_)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const std::shared_ptr<ValidateType>&?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, if you move the declaration of type to the top, the default constructor/initializer order should just do the right thing, with default values for the other 2 members.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, that applies to ValidateType as well. You could omit the constructor overrides and use initializer syntax and default constructors for pretty much every struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will change that one.

std::shared_ptr<ValidateType> returnType;
ValidateTypeFieldArguments arguments;

ValidateTypeField() = default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of these constructors match the default compiler generated constructors. You could just omit them.


bool isInputType() const;
ValidateType getType();
std::shared_ptr<ValidateType>&& getType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should return by value. From what I've heard return-value-optimization (RVO) works better that way.

To get move semantics you can move to a local variable (e.g. auto result = std::move(value);) and then return the local variable by value, and all the RVO goodness should apply.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh really? I changed some of those to get it shorter 👀

seriously, the more I look into C++, the more I dislike this language 😆

// This is taking advantage of the fact that during validation we can choose to execute
// unvalidated queries against the Introspection schema. This way we can use fragment
// cycles to expand an arbitrary number of wrapper types.
ast.validated = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the canonical IntrospectionQuery does have this limitation, where it can only follow links to a fixed depth. It's compatible with more tools, but it does put a limit on the complexity of the schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get this, but while there may be a combination of lists, non-null and the actual type, it's pretty unusual in real life, even more that it would not work in any other GraphQL tool.

Do you want me to move and use the non-standard recursive here?

wravery added a commit to wravery/cppgraphqlgen that referenced this pull request Dec 6, 2020
@wravery
Copy link
Contributor

wravery commented Dec 6, 2020

I think a lot of the benefits of this change could be achieved just by keeping a std::unique_ptr alive in a member on the Request. Would you try that change as a smaller comparison? I think it would be a lot easier to reason about it that way than by passing the response to the IntrospectionQuery into the Request. If that one change is enough to get similar results I'd prefer to do that.

I went ahead and merged that change as part of #131, so if you want to repeat your measurements with that version we can see how effective it is compared to the full change in your PR. I can try to take my own before and after measurements, but I won't be able to compare them directly to your environment or results.

@barbieri
Copy link
Contributor Author

barbieri commented Dec 7, 2020

@wravery I'll spare some time to do the measurements and resolve the conflicts. I also have the reponse::ResultType in place on top of my branch that I need to measure and compare, but seems to be nicer on memory.

However just keeping the visitor as you did will not solve all the issues, namely it will not allow me to remove the introspection from production server, also there is no optimized lookup for types (as you keep the not-so-cheap maps, lookups, conversion to kind and generation of type-strings to make comparison simpler).

I think it would be a lot easier to reason about it that way than by passing the response to the IntrospectionQuery into the Request

Also, not sure you've noticed, but in my code you can still get the introspection if you do not provide the values, it's backward compatible. However in near future I plan to generate the introspection results in code, then we can avoid the introspection query, only building the results using response::Value

If that one change is enough to get similar results I'd prefer to do that.

Problem is that we're still too slow, at least compared to Apollo, we're about 3x slower, which in turn is like 2-3x slower than Go implementations. So we need some more work, my initial PR is not enough.

@barbieri
Copy link
Contributor Author

barbieri commented Dec 7, 2020

Massif using fb4a589

    KB
401.4^                  #                                                     
     |                  #:::    :::                      @   :   :   :  :   : 
     |  @    ::   ::: ::#:: ::: :: :::@:::::::@::::::::::@::::::@:::::@:::::@:
     |  @:::::  :::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |  @:: ::  :::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |  @:: ::  :::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |  @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |  @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |  @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |  @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |::@:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |: @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |: @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |: @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |: @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |: @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |: @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |: @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |: @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
     |: @:: :: ::::: :::#:: :::::: :::@::: :::@:::::::: :@::::::@:::::@:::::@:
   0 +----------------------------------------------------------------------->Mi
     0                                                                   463.7

Number of snapshots: 82
 Detailed snapshots: [3, 18 (peak), 32, 39, 49, 59, 69, 79]

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1      5,002,462          219,232          209,971         9,261            0
  2     14,225,829          343,512          324,739        18,773            0
  3     18,831,299          380,896          358,031        22,865            0
...
 18    122,066,937          411,064          380,458        30,606            0
 19    126,340,408          381,168          353,213        27,955            0
 20    134,532,557          393,712          364,716        28,996            0
 21    141,563,410          392,360          363,638        28,722            0
...
 75    460,585,577          378,736          350,903        27,833            0
 76    464,859,164          374,920          347,463        27,457            0
 77    469,134,602          379,752          351,933        27,819            0
 78    473,414,161          391,240          362,541        28,699            0
 79    477,688,737          386,696          358,116        28,580            0
 80    481,964,540          374,200          346,823        27,377            0
 81    486,240,263          378,200          350,415        27,785            0

So it's running around 374kb, while my branch runs at 322Kb.

dhat:

Times {
  t-gmax: 125,212,274 instrs (25.56% of program duration)
  t-end:  489,932,197 instrs
}

This improved a bit compared to mine (224, 575) likely because of the lazy __Type query, but in my real usage case (and likely others in production) it's better to pay that price upfront instead of impact queries when they use a new type.

I'll rebase my work on top and see how both play along together (but still avoiding individual __Type queries)

Doing the introspection query all the time is hurting performance,
this does not change, so a single query can be done with all the
fields, build a validation tree and then query it on all validations.

This is the first part, moving the top-level query to be read-only.

The next commits will eliminate other `executeQuery()`, then a shared
context will be created and hosted by the `Response` class, which can
be discovered using introspection or fed using JSON (schema.json).
This is an incremental commit, just make use of the read-only data
instead of `release` primitives, allows sharing the query results.
This is an incremental commit, just make use of the read-only data
instead of `release` primitives, allows sharing the query results.
Split the fields getter and cache/insertion into the map so they can
be used later in a different way.

There should be no code differences, just moving around the internal
branch to its own function.

In the next commits, this will be removed from the getter, as it will
be query-only as the types will be all cached beforehand.
@wravery
Copy link
Contributor

wravery commented Dec 7, 2020

However in near future I plan to generate the introspection results in code, then we can avoid the introspection query, only building the results using response::Value

Wouldn't you need to edit the generated files to do this? Alternatively, maybe there should be a switch for schemagen which suppresses the runtime Introspection fields.

If we modify the schemagen tool, we could also teach it to generate a static data structure with no parsing/serialization to be used with validation. It would take the responsibility of pre-caching any IntrospectionQuery off of the consumer and they wouldn't have to pass the extra response::Value. It should also be much faster since it wouldn't be operating on the response::Value type at all (part of what your PR has been doing, but without even the initial intake of the response::Value).

Problem is that we're still too slow, at least compared to Apollo, we're about 3x slower, which in turn is like 2-3x slower than Go implementations. So we need some more work, my initial PR is not enough.

Interesting, I never tried a direct comparison with either of those. Partly that's because I've been thinking of it as filling a different niche, specifically interop with existing C++ code in a hybrid web or React Native client. Upon handing off to the JS UI code, most of the native perf concerns become less relevant, they're generally orders of magnitude faster/cheaper just being native. I do mostly desktop development, so even Electron is generally fast enough.

So in your scenario, are you running just a GraphQL service on the device and handling the results elsewhere? Can you share a sample for either of those alternatives so I can see what we're up against?

This handles OBJECT, INTERFACE, UNION and INPUT_OBJECT types.

It should have no behavior change, just moving code around. Minor
adjustments were made to cope with the iterator return
It should have no behavior change, just moving code around.
This uses the information being queried in the introspection and
allows the fields and input fields to be processed in one go.
This is another step to split the visitor from the lookup data
structures, in the future the lookup will be shared.
ValidateExecutableVisitor was split into a lookup data structure
(ValidationContext) and the actual visitor.

The lookup data structure is shared across requests, saving queries
and processing.
We do not need a map, there are only 3 well defined names
@barbieri
Copy link
Contributor Author

barbieri commented Dec 7, 2020

However in near future I plan to generate the introspection results in code, then we can avoid the introspection query, only building the results using response::Value

Wouldn't you need to edit the generated files to do this? Alternatively, maybe there should be a switch for schemagen which suppresses the runtime Introspection fields.

I was thinking a #if SCHEMAGEN_DISABLE_INTROSPECTION == 1, so we can generate this kind of flag in the user code (autoconf/cmake).

If we modify the schemagen tool, we could also teach it to generate a static data structure with no parsing/serialization to be used with validation. It would take the responsibility of pre-caching any IntrospectionQuery off of the consumer and they wouldn't have to pass the extra response::Value. It should also be much faster since it wouldn't be operating on the response::Value type at all (part of what your PR has been doing, but without even the initial intake of the response::Value).

Yeah, this is my ultimate goal. I'm close to that in my PR (still cleaning up), first I'm working on the data structures (which I should push tomorrow or so), then I'll generate this validationContext directly in code.

Problem is that we're still too slow, at least compared to Apollo, we're about 3x slower, which in turn is like 2-3x slower than Go implementations. So we need some more work, my initial PR is not enough.

Interesting, I never tried a direct comparison with either of those. Partly that's because I've been thinking of it as filling a different niche, specifically interop with existing C++ code in a hybrid web or React Native client. Upon handing off to the JS UI code, most of the native perf concerns become less relevant, they're generally orders of magnitude faster/cheaper just being native. I do mostly desktop development, so even Electron is generally fast enough.

Usually, yes. But in my case there is no render being done, just data being normalized to another device (web, android, ...) where the GraphQL data is displayed.

So in your scenario, are you running just a GraphQL service on the device and handling the results elsewhere? Can you share a sample for either of those alternatives so I can see what we're up against?

Yes, this is an embedded device and it will normalize various sources as GraphQL queries, I cannot disclose much at this point (working for a customer under NDA), but we generate the resolvers to access some sources in C++. As the number of sources and properties are large and the hardware is underpowered, we did run into performance issues, that's why I'm trying to fix them.

@wravery
Copy link
Contributor

wravery commented Dec 7, 2020

I cannot disclose much at this point (working for a customer under NDA)

👍 No problem, I was just hoping you already had benchmarks you could share for Apollo or Go.

@barbieri
Copy link
Contributor Author

barbieri commented Dec 8, 2020

@wravery pushed what I have done so far, changed what you pointed in the first review (commits were edited, so take a look at them again, there are no fixup commits)

I've reworked the ValidateType to be an abstract base class and added specialized classes ScalarType, EnumType, InputObjectType, ObjectType, InterfaceType and UnionType. This saves some memory (no need to store kind) as well as cleans up code as each virtual can implement its behavior (getInnerType(), etc).

Adding fields to input/objects were split into a second iteration, this way we know for sure the named types exist and will use those references.

Things like _scopedType are now all references to the actual types, this reduces the number of lookups.

Also did some work to reduce the memory allocations, moving some of the std::string to std::string_view. The remaining std::string exist for stuff that uses the introspection to be declared (enums, fields...). Once I work on the code generator, the introspection and these std::string should be gone.

This PR now contains the response::Type::Result, to keep the data, errors in a more efficient way.

@barbieri
Copy link
Contributor Author

barbieri commented Dec 8, 2020

weird, on MacOS/clang it's not giving that error. I'll test on Linux

Instead of using a map with properties `name`, `kind` (string) and
`ofType` (another map), use a set of custom classes with kind as
enumeration and ofType as shared pointer.

This allows much simpler handling and comparison, no need to serialize
to string to make it simpler to compare.

We can also store reference to types, know which kind (ie:
isInputType?) and save memory by using references, in particular to
common types such as Int, Float, String...

The matchingTypes and fields are stored as part of each ValidateType
specialization.
Instead of 2 maps + set (both ordered), use one single unordered_map
(string_view) + unordered_set (pointer to definition).

The string_view is okay since the ast_node tree is valid during the
processing, so the references are valid.

The pointer to definition is also okay for _referencedVariables, since
the defintitions are all created upfront and the map (thus references)
won't change while visiting the fields.

The 2->1 map was possible since we're now storing the definition
location instead of using a second map just to store the ast_node to
query the position on errors.
pre-allocate a vector and populate it, then iterate directly instead
of using a queue
Instead of always creating a recursive resolver, which in turn may
call `std::async()`, only do that if the result is not readily
available.
The wrap is not for free and is, more often than not, useless.
Just minor tweaks to make it compile, moving the template functions to
the header and also marking virtuals as final.

In the next commits it will be moved to more public usage, including
the generator.
Soon there will be a generated ValidationContext, thus we don't need
to carry any of the introspection bits.
This is basically moving code around, change the parameters to allow
the request to receive the ValidationContext.

Bring back the original Request() constructor so it will not break
dynamically linked binaries.

The introspection results are kept around, in the future the
validation context will only use pointers to strings (string_view) and
everything must be alive in order to work.
@barbieri
Copy link
Contributor Author

@wravery finally got everything to work 😅 It became HUGE, the commits are not in the best order possible as I fixed some issues while I was reading/measuring the code paths.

The code generator now outputs ValidationContext that is given to Request(), in such case the introspection is not used. The legacy constructor still uses the introspection, so existing binaries should work.

You can skim through the commit messages to see all that was changed, but a summary is:

  • ValidationContext should be subclassed, one option is IntrospectionValidationContext (default), but the schemagen creates one.
  • response::Value::ResultType to store data/error pairs
  • reponse::Value is now uses an internal unique_ptr<TypeData> that is an abstract class. Specializations for all types make things a bit cleaner and smaller, no need to store type, Enum/String/JSONString are different classes
  • response::Value with complex data uses shared_ptr<> and Copy-on-Write semantics
  • FieldResult conversions are always deferred, regardless of the params.launch mode. Also the basic conversions were optimized, avoiding std::async() whenever possible.
  • do not convert to {data,errors} (response::Value::ResultType) unless needed, avoid the extra wrap.
  • std::queue (deque) replaced with std::list to process selections, less memory and also allows to join lists without loop-move-pop;
  • std::string_view replaces std::string whenever possible, Validation now runs fully on string_views (but the JSON introspection will keep the introspection results alive to allow that -- we could keep the strings and throw away the rest, but not done atm).

DHAT is impressive 125,212,274 -> 9,472,310 (max), with half of memory used by graphql::peg stuff.

Times {
  t-gmax: 9,472,310 instrs (2.79% of program duration)
  t-end:  339,659,840 instrs
}

This massif chart should get you a clear picture of the final results, there is no peak anymore and memory is stable at ~300Kb.

    KB
302.0^                 ::                                                     
     | ##:: ::@: ::::: : ::  ::  :::@: ::::::::::  ::::::::      ::::::::@::::
     | # : :: @::: ::::: ::::: :::::@::: ::::: ::::::: : ::::::::: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
     | # : :: @::: ::::: ::: : :::::@::: ::::: ::: ::: : ::: :: :: ::::::@::::
   0 +----------------------------------------------------------------------->Mi
     0                                                                   323.0

Number of snapshots: 65
 Detailed snapshots: [2 (peak), 7, 26, 58]

@barbieri
Copy link
Contributor Author

Damn, the test failed on CI, related to __typename, @if(), @include() and the other built-in stuff. Seems the schemagen is not generating those.

This avoids the introspection query and simplifies the build of lookup
maps
The generated file contains "#ifdef SCHEMAGEN_DISABLE_INTROSPECTION",
if that is set then the introspection blocks will be disabled:

 - no __schema and __type resolvers
 - no AddTypesToSchema
 - no _schema field
@barbieri
Copy link
Contributor Author

@wravery now it includes all the introspection stuff (I did a complete generation, even if the current schema doesn't contain any enums or input types, if we add those in the future, the generator will just work.

It also includes #ifndef SCHEMAGEN_DISABLE_INTROSPECTION and can generate binaries without any introspection fields/resolvers. Added sample_nointrospection and nointrospection_tests to make sure those work.

@barbieri
Copy link
Contributor Author

Running the sample without introspection, the results are:

DHAT 125,212,274 -> 6,522,087, 49% of the allocated memory is in graphqlpeg, followed by field_path using 25%.

Times {
  t-gmax: 6,522,087 instrs (1.94% of program duration)
  t-end:  336,125,099 instrs
}

Massif reports less than half memory used:

    KB
128.2^     :                                                                  
     | #   :    ::         :     ::           :               :@         ::   
     | # :@:::: : : :::::  : :: ::     : ::::::::       @::  ::@:  :::   :::  
     | #::@:: ::: ::::: :::::::::: ::@::::::: :: :::::::@::::::@:::::@:: :::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
     | #::@:: ::: ::::: :::::::::: ::@ :::::: :: :::::::@::::::@:::::@::::::@:
   0 +----------------------------------------------------------------------->Mi
     0                                                                   318.9

Provide `push()` and `pop()` convenience methods so it's the same as
`queue`.

The `list.size()` is not as fast, however these lists are often small
enough to not matter (walk the list counting the elements)
Change SelectionSetParams to keep an optional reference to the parent,
this way we don't need to build the path over and over again just to
add one element, instead we create this on demand.

Since errorPath was accessed directly, this breaks the existing code,
it became a method that dynamically computes the error path
(recursive).

This is important since we just pay the list copy price when there is
an error and not in all field resolution.
Use this specific constructor in list converter, creating one
itemParams with the new ownErrorPath, instead of changing the wrapper
request param.
We shouldn't modify the parameters using a string is causing it to
copy the field name, which is particularly bad when processing huge
lists (it would copy the name for each item).
We don't change it anymore, we don't push to the array, then we can
keep it inline in the parent structure, avoiding the extra allocation.
@barbieri
Copy link
Contributor Author

@wravery with this last commit, peg is 68% of the allocation, everything else runs much smoother.

I'm running out of time to work on more optimizations, but if you know how to get peg to play nicer with memory, let me know

@wravery
Copy link
Contributor

wravery commented Dec 10, 2020

I'm running out of time to work on more optimizations, but if you know how to get peg to play nicer with memory, let me know

Sounds good. I'm going to make a cleanup pass to makes sure it's consistent with the rest of the project, and then I should be able to get it merged sometime this week.

Thanks for this contribution!

@barbieri
Copy link
Contributor Author

I cannot disclose much at this point (working for a customer under NDA)

👍 No problem, I was just hoping you already had benchmarks you could share for Apollo or Go.

I forgot to reply to this one. I can't share much details due NDA, but raw numbers (C++ runs with std::launch::async, we'll change that later):

Test Framework Results
Flat Schema Apollo/JS 11s
Flat Schema C++ cached-validation 20s
Flat Schema C++ Pre Optimizations 305s
Nested Schema Apollo/JS 11s
Nested Schema C++ cached-validation 22s
Nested Schema C++ Pre Optimizations 110s

"Pre Optimizations" is 3add6d3, BEFORE your cached validation visitor. By far that was the biggest source of slowness. Just that helped a lot, however other changes like changing the converters to be more efficient, remove some useless std::launch::async and so on, also helped ... bit by bit.

This test is an artificially generated schemas, one is deeply nested, the other is a huge flat schema. We're querying 50 leaf fields, 500 queries over the network (HTTP/GET), using websocketpp in the C++ version.

We don't have it written in Go to say for sure, but given this https://github.com/appleboy/golang-graphql-benchmark and https://github.com/the-benchmarker/graphql-benchmarks/blob/develop/rates.md we can estimate how slow JS is compared to Go.

@wravery
Copy link
Contributor

wravery commented Dec 11, 2020

While reviewing this, I thought of another approach that I'd like to take. Rather than building a separate ValidationContext, I split the graphql::introspection types in Introspection.h into a compact, read-only set of structs with polymorphic implementations of BaseType (in the graphql::schema namespace inside a new file called GraphQLSchema.h), and I made all of the graphql::introspection types take those schema objects as a std::shared_ptr and just provide the service::Object accessor bindings that call through to the schema objects. Also, rather than relying on #ifdef

TL;DR; I'm not driving the validation through a separate hierarchy of cheaper validation objects, I'm driving both validation and introspection through a separate hierarchy of cheaper introspection objects.

Also, instead of using pre-processor directives, I added a --no-introspection switch to schemagen which will just generate the code without any of the declarations for the __schema and __type fields. In this mode, all it needs is the graphql::schema namespace objects, I still need to do some work to avoid linking the unused graphql::introspection namespace objects when they won't be used.

I have a little more cleanup on this approach to go, and I want to try rebase or merge some of your other fixes on top of that, but the memory savings seem very promising. I also noticed my unit tests run in about 2-3 times faster 🎉! Here's what I got from debug builds running on WSL 2 (using a benchmark program which I also added based on your PR):

massif against master

--------------------------------------------------------------------------------
Command:            ./build/samples/benchmark
Massif arguments:   (none)
ms_print arguments: massif.out.5376
--------------------------------------------------------------------------------


    KB
401.4^                  #
     |              :   # : :  :                      @:      ::      ::
     |       @:   ::: ::#@::::::::::::::@:::::@@::::::@:::::@:::::@:::::@::::@
     |   @:::@: ::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     | ::@:::@: ::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     | : @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     | : @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
     |:: @:::@::::::::: #@:::::: :::::::@ ::: @ ::::::@:::::@:::::@:::::@::::@
   0 +----------------------------------------------------------------------->Mi
     0                                                                   477.7

massif against this PR I re-ran this after confirming my copy of this branch was up-to-date, and I'm still getting very different results from your last update, more in line with the previous update:

--------------------------------------------------------------------------------
Command:            ./build/samples/benchmark
Massif arguments:   (none)
ms_print arguments: massif.out.9027
--------------------------------------------------------------------------------


    KB
309.4^                                                    :
     |  #:::::::::::::@:::::::::::@:@:::::::@::::::::::::@::::::@::::::@::::::
     |  #::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     |  #::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     |  #::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     |  #::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     |  #::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
     | :#::: ::: :: ::@::::::: :::@:@:::::::@: ::::::::::@::::::@::::::@::::::
   0 +----------------------------------------------------------------------->Mi
     0                                                                   280.7

massif against my version

--------------------------------------------------------------------------------
Command:            ./samples/benchmark
Massif arguments:   (none)
ms_print arguments: massif.out.12238
--------------------------------------------------------------------------------


    KB
166.5^ ##
     | #      :        :: : :             ::       :      : :              ::
     | # ::   :     : :::::::     :::::  :: :::::: ::   :::::    :   : :  ::::
     | # @ @::::::::::::::::::::::: :: :::: :: : :::::::: ::::  ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::::::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
     | # @ @:::::: ::::::::::: :: : :: :::: :: : :::::::: ::::: ::@::::::@::::
   0 +----------------------------------------------------------------------->Mi
     0                                                                   349.4

massif against my version with --no-introspection This is generating the compact schema representation, but it blocks loading the introspection::* objects on top of that.

--------------------------------------------------------------------------------
Command:            ./samples/benchmark_nointrospection
Massif arguments:   (none)
ms_print arguments: massif.out.12523
--------------------------------------------------------------------------------


    KB
161.0^ #
     | #     :::::: ::                     :::     ::  ::    :        :   :
     | #   :::: ::: :              : :   : : ::::  :   :: :  :  ::   ::  :: ::
     | #:::: :: ::::: ::::::::::::::@::::::: ::::::: :::::: :::::::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
     | #:: : :: ::::: : ::: :: ::: :@:: :::: ::::::: :::::::::: :::::::::::@::
   0 +----------------------------------------------------------------------->Mi
     0                                                                   327.3

The branch where I'm working on this is https://github.com/wravery/cppgraphqlgen/tree/merge-cached-validation. If you want to run any of your own tests on that you can.

Remaining work:

  • Finish auditing all of the uses of std::string_view vs. std::string. Most of the time we can avoid copying the strings in validation because they either come from a peg::ast parse tree which owns the memory or they are hardcoded literals built into the schema representation in the generated code. But there are a few cases where we build or alter the string, and it needs to be kept alive as a std::string without being implicitly converted to std::string_view and losing the temporary variable.
  • Merge the latest changes from this PR into my copy of that branch, and retake the measurements.
  • Merge/re-base some of the other optimizations from this branch, e.g. around result/error handling. Those should only apply to query resolution, not validation, since there are no more references to the graphql::response namespace except the variable value visitor.
  • Try splitting introspection support (what gets shut off with --no-introspection) into a separate library to reduce code size when it's not used. This shouldn't affect heap, but it may save file space and overall memory consumption.

@barbieri
Copy link
Contributor Author

@wravery that's okay, just be sure to compare to my latest version, since you report massif against this PR: 128.2. Notice that was running the sample_nointrospection rather than sample.

As for the string x string_view, take a closer look and let me know. I did review them extensively, aside from ResolverParams everything else was basically private and never changed. I don't see any reason to modify the fieldName. Also for the errorPath, notice that since I changed the list processing, it stopped to push and modify per-item resolver -- rather it creates the per-item wrapper (as before) with the new ownErrorPath (new). This way we avoid the queue/list and build an implicit list referencing the parent SelectionSetParams. The errorPath is then generated on demand.

Meanwhile was triggered by the ast stuff, as I never worked with PEGTL it took me a while, but https://github.com/profusion/cppgraphqlgen/compare/cached-validation...profusion:parser-tweaks is evolving. What is left is a way to cache the ast_node, similar to https://github.com/taocpp/PEGTL/blob/master/src/example/pegtl/parse_tree.cpp#L149-L157 (its example is not thread-safe). However that branch offers a parseExecutableString() (which skips everything but executable_definition) and also a sample to output parser stuff, including tracing the query, output a graphviz-dot chart... which I find helpful.

Do you know of some simple way to get this memory arena/allocator pool done? Currently the PEGTL generates A LOT of useless nodes, every node is first built to later be discarded, which causes too much pressure on the memory allocator.

So far, without the arena/pool it removes around 10kb from peak:

    KB
118.5^ #                                                                      
     | #   :::::     :::::  ::::::@@@:@@:         ::::::::::@@:::  :: :::     
     | #:::: : ::::::: :::::::  ::@  :@ ::::::::::: : :: : :@ :: :::::: :::: :
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
     | #: :: : :: : :: :::: ::  ::@  :@ :: :::: ::: : :: : :@ :: :::::: :: :::
   0 +----------------------------------------------------------------------->Mi
     0                                                                   297.0

Number of snapshots: 50
 Detailed snapshots: [1 (peak), 19, 21, 36]

@wravery
Copy link
Contributor

wravery commented Dec 12, 2020

Notice that was running the sample_nointrospection rather than sample.

Got it. I made a separate benchmark_nointrospection executable as well, and it doesn't reach your level of optimization yet, but it does help a little. After compressing response::Value I get this for benchmark:

    KB
165.1^ #
     | #  @            :     :::    : : :         :: ::: ::  ::     :     ::
     | #@@@  :     : : ::   :: ::: :: ::: :: :::  :: ::  ::  ::: : :::: @ :::
     | #@ @::::::::::::::::::: ::::::::::::::: ::::@@:: :::@ :::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
     | #@ @: ::::: :::::::: :: ::::::::::::::: ::::@ :: :::@::::::@:::::@:::::
   0 +----------------------------------------------------------------------->Mi
     0                                                                   365.5

And this is benchmark_nointrospection:

    KB
159.6^ #
     | #:: :: ::                     :          ::  ::    @           :  :  :
     | #: :: :: :     ::: @@@ :::  ::::: :  :::::   :  :  @ :: :    :::  :  :
     | #: :: :: ::::: :: :@@ :: :::: :: :::::: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
     | #: :: :: ::: :::: :@@ :: :::: :: ::: :: :: ::: ::::@:: ::@:::: :::::::@
   0 +----------------------------------------------------------------------->Mi
     0                                                                   343.6

Almost all of the savings apply with or without --no-introspection, but I should be more in line with your results after I pull in more of your optimizations.

Meanwhile was triggered by the ast stuff, as I never worked with PEGTL it took me a while, but profusion/cppgraphqlgen@cached-validation...profusion:parser-tweaks is evolving. What is left is a way to cache the ast_node, similar to https://github.com/taocpp/PEGTL/blob/master/src/example/pegtl/parse_tree.cpp#L149-L157 (its example is not thread-safe). However that branch offers a parseExecutableString() (which skips everything but executable_definition) and also a sample to output parser stuff, including tracing the query, output a graphviz-dot chart... which I find helpful.

Interesting, so it looks like this is meant to ignore the schema definition part of the grammar, correct? That makes sense for most purposes, and the same optimization could be used in reverse for the schema generator. However, the validation section of the spec specifically mentions rejecting documents for execution if they have any non-executable elements (and vice versa for schema definitions IIRC). This will convert an error about that specifically into a parser error. If you split the document rules into separate executable and schema documents, no single document should satisfy both, but it might be nice to add a handler for the parse error which checks to see if it matches the unified document grammar and converts that to the same error message about how they shouldn't be mixed together. It should even be possible to parse the grammar and see if it matches without executing any of the actions, so parsing against the unified document rule in the fallback would not need to construct an ast at all.

I would also swap the meaning of the two parse* methods, the one which parses a schema document should only be used internally by schemagen. It could even be pulled out of the main header and defined inside of SchemaGenerator.*. The schemagen tool doesn't need to be faster, so it could just keep using the full document grammar without defining a separate schema_document rule or fallback logic for the parse errors on a sub-grammar.

Do you know of some simple way to get this memory arena/allocator pool done? Currently the PEGTL generates A LOT of useless nodes, every node is first built to later be discarded, which causes too much pressure on the memory allocator.

I think you can still inherit from parse_tree::basic_node<ast_node>, it just needs to implement the operator new and operator delete overrides. I think most of the 10KB savings are coming from switching unescaped to a string_view, is that right? To plug those into a mempool, I think you'd want a regular node type which implements everything normally, but then create an inherited type that implements the operator new/delete type and define the parse tree in terms of the custom allocated type. The mempool would be defined as a container of the base type so it can manage its own memory, and the parse tree would allocate into the container with the overrides on the sub-type.

For simplicity, high throughput, and data locality, I would suggest using a std::deque for the node mempool:

As opposed to std::vector, the elements of a deque are not stored contiguously: typical implementations use a sequence of individually allocated fixed-size arrays, with additional bookkeeping, which means indexed access to deque must perform two pointer dereferences, compared to vector's indexed access which performs only one.

The storage of a deque is automatically expanded and contracted as needed. Expansion of a deque is cheaper than the expansion of a std::vector because it does not involve copying of the existing elements to a new memory location. On the other hand, deques typically have large minimal memory cost; a deque holding just one element has to allocate its full internal array (e.g. 8 times the object size on 64-bit libstdc++; 16 times the object size or 4096 bytes, whichever is larger, on 64-bit libc++).

The tradeoff is allocating a few extra nodes (based on the chunk size) to round up, but if they're empty that's not a lot, and it will greatly decrease the number of calls to the allocator in addition to making most adjacent nodes fit together in a cache line.

Since the actual parsing is single threaded, but the parsing might happen concurrently on multiple threads, you might try adding the thread_local storage specifier to a static mempool container to add thread-safety without needing any locks. The last time I needed it was on a C++11 project, and thread_local was new and not fully supported yet on our toolchain. But now that cppgraphqlgen requires C++17, the compilers which support that should support thread_local as well.

@wravery
Copy link
Contributor

wravery commented Dec 12, 2020

More random thoughts about the mempool:

  • The custom allocator sub-type should also store its index in the container, so perhaps it should include the ast node as a member instead of inheriting. To avoid shifting indices on delete, it should also leave itself (emptied of any state) in the container and add that index to a free-list/set for re-use.
  • It may make sense to either cleanup the container and free-list after each parse, or to keep it around and reuse the allocations in the next parse. If it doesn't deterministically cleanup after each parse, it ought to have a mechanism to explicitly cleanup the container, otherwise it won't recover the leftover memory from exceptionally large parse trees. This might be a good place to add an override, so if you call the version without a mempool it declares a per-parse mempool as a local variable and uses that, otherwise it uses the one you passed in and you can keep it alive however long you want.

@wravery
Copy link
Contributor

wravery commented Dec 12, 2020

As for the string x string_view, take a closer look and let me know. I did review them extensively, aside from ResolverParams everything else was basically private and never changed.

There are a lot of ways to misuse string_view and introduce memory errors, I hit a few of them myself even in my merge of this PR, so I'm cautious about taking a sweeping change to replace string with string_view.

Generally, replacing string with string_view is safe and effective, as long as you can guarantee the string_view will not outlive the buffer it's pointing to. It's really better suited in that respect to short-lived variables/parameters which go out of scope and are destroyed, but you can kind of cheat and use them indefinitely with hardcoded string literals since those are in the code/data segments of the executable and will never be freed. You can also use them in a bigger scope (e.g. as members) pointing to heap allocations as long as you can guarantee the lifetime of the the string_view is starts and ends within the lifetime of the heap buffer, e.g. when operating on a parse tree which out-lives the operation you're performing. All of this is pointing to arguments for writing everything in Rust. 😆 The good news is C++ static analyzers are constantly improving and can help detect potential issues like this in the future, but it's unlikely they'll ever be quite as safe as the Rust lifetime checker.

Couple of other points about std::string_view:

  • They should always be passed/returned by value. They're already effectively a pair of pointers, so you don't need the indirection of a reference or pointer to pass (or return) them efficiently on the stack. The only time you would need a reference is if you want to manipulate a string_view elsewhere, e.g. as an out-param or as a mutable member, but most of the time you should use a return type rather than an out-param.
  • Using const std::string_view is OK for local variables, just like it's good to make anything that doesn't change const, but since they're typically only passed/returned by value you should probably not declare a string_view parameter or return type as const because that puts needless constraints on the implementation. The exception would be a const std::string_view& which points to volatile but locally immutable data stored elsewhere.

@wravery wravery mentioned this pull request Dec 14, 2020
@wravery wravery closed this in #133 Dec 14, 2020
@wravery
Copy link
Contributor

wravery commented Dec 14, 2020

For simplicity, high throughput, and data locality, I would suggest using a std::deque for the node mempool

Never mind, this was not as efficient as I hoped. The overall memory usage is higher and it's a little slower. I was able to plug in a std::array based cache which just holds on to pointers returned by the default allocator like in the PEGTL sample, and it seems to work fine with thread_local. But so far I'm not really seeing any measurable improvement from that, it may need to be bigger to make a difference (I went with 32 entries as in the sample), but then it increases the amount of memory we hold on to.

@barbieri
Copy link
Contributor Author

For simplicity, high throughput, and data locality, I would suggest using a std::deque for the node mempool

Never mind, this was not as efficient as I hoped. The overall memory usage is higher and it's a little slower. I was able to plug in a std::array based cache which just holds on to pointers returned by the default allocator like in the PEGTL sample, and it seems to work fine with thread_local. But so far I'm not really seeing any measurable improvement from that, it may need to be bigger to make a difference (I went with 32 entries as in the sample), but then it increases the amount of memory we hold on to.

I tested that hack as well, but it was quickly, without checking the number of entries. What I found earlier during some prints is that the quick alloc-dealloc were hitting the malloc cache, so it was not going to the kernel.

However that reuse improved only once I reduced the ast_node size in my version (removed some useless fields, such as end), also helped a bit to make source a string_view and in the performance it helped to compare to typeid().hash_code() before doing the comparison of the demangled result (I don't have those exact reports at hand, but were something I noticed during the development).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants