Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-mapnik faster on linux #78

Closed
springmeyer opened this issue Jul 22, 2018 · 11 comments
Closed

node-mapnik faster on linux #78

springmeyer opened this issue Jul 22, 2018 · 11 comments
Projects

Comments

@springmeyer
Copy link
Contributor

springmeyer commented Jul 22, 2018

@millzpaugh is seeing slower results in staging benchmarking, so I'm concerned that some of the vtcomposite code may not be faster than node-mapnik.

So I just ran the latest benchmarks on linux (a stageyprod machine) using vtcomposite @ 695dd3d (with the fixes to the bench in #77).

I'm able to consistently replicate slower results with vtcomposite than node-mapnik for 3 bench rules using this command:

node bench/bench.js --iterations 200 --concurrency 6 --package node-mapnik --compress --mem

node-mapnik

5: tiles completely made of points, overzooming, no properties ... 6667 runs/s (30ms)
6: tiles completely made of points, same zoom, no properties ... 4878 runs/s (41ms)
17: buffer_size 4096 - tiles completely made of polygons, overzooming (2x) and lots of properties ... 1980 runs/s (101ms)
Benchmark peak mem (max_rss, max_heap, max_heap_total):  993.38MB 9.48MB 18.03MB
Benchmark iterations: 200 concurrency: 6

vtcomposite

5: tiles completely made of points, overzooming, no properties ... 4878 runs/s (41ms)
6: tiles completely made of points, same zoom, no properties ... 2439 runs/s (82ms)
17: buffer_size 4096 - tiles completely made of polygons, overzooming (2x) and lots of properties ... 1460 runs/s (137ms)
Benchmark peak mem (max_rss, max_heap, max_heap_total):  563.96MB 8.2MB 13.03MB
Benchmark iterations: 200 concurrency: 6

/cc @artemp @alliecrevier @millzpaugh @joto @ericfischer

@springmeyer
Copy link
Contributor Author

I did some digging here (still on linux). I've noticed 2 things:

  • The work at Property mapper #79 does not benefit these particular rules (while it does benefit others)
  • The speed of node-mapnik is not as great (or the same) when individual benchmarks are run (using fe20f22). It is only when all the benchmarks are run that node-mapnik is predictably faster
  • profiling with perf shows that the bottlenecks are different for node-mapnik and vtcomposite: for some reason longest_match is a more severe bottleneck for vtcomposite

Doing this:

perf record -g node --perf_basic_prof_only_functions bench/bench.js --iterations 500 --concurrency 1 --package vtcomposite --compress --mem --only 'tiles completely made of points, overzooming, no properties'
perf report -g --no-children

Gives this for node-mapnik:

screen shot 2018-07-22 at 2 58 34 pm

And this for vtcomposite:

screen shot 2018-07-22 at 2 59 02 pm

@springmeyer
Copy link
Contributor Author

Further digging reveals one potential cause for longest_match being a larger bottleneck for vtcomposite: it is called more times than in node-mapnik.

I found this out by using gdb to count the number of times it is called (which I learned from https://monoinfinito.wordpress.com/2014/11/18/gdb-hit-count-ignoring-breakpoints-for-a-while/):

I ran the benchmark with just one iteration (to keep things sane):

gdb --args `which node` bench/bench.js --iterations 1 --concurrency 1 --package vtcomposite --compress --mem --only 'tiles completely made of points, overzooming, no properties'

I changed the bench.js code like this (to reduce the deflate calls from node.js's zlib):

diff --git a/bench/bench.js b/bench/bench.js
index 0da5d22..b6c405d 100644
--- a/bench/bench.js
+++ b/bench/bench.js
@@ -46,7 +46,7 @@ if (rules.length < 1) {
 }
 
 rules.forEach(function(rule) {
-  if(argv.compress){
+  if(false){
     rule.tiles.forEach(function(t){
       const compressedTile = zlib.gzipSync(t.buffer);
       t.buffer = compressedTile;

Then in gdb I set a breakpoint on longest_match:

b longest_match

Then I ask gdb to skip past all matches:

ignore 1 999999999

Then I ran the program:

r

That reveals that longest_match was called 2535 times:

Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x0000000001273ea0 <longest_match>
	breakpoint already hit 2535 times
	ignore next 999997464 hits

Doing the same thing for node-mapnik with gdb --args which node bench/bench.js --iterations 1 --concurrency 1 --package node-mapnik --compress --mem --only 'tiles completely made of points, overzooming, no properties'.

This shows that node-mapnik only called longest_match 1634 times:

Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x0000000001273ea0 <longest_match>
	breakpoint already hit 1634 times
	ignore next 999998365 hits

@springmeyer
Copy link
Contributor Author

springmeyer commented Jul 23, 2018

Okay, this hypothesis is starting to feel solid: that longest_match is being called more with vtcomposite generated data than with node-mapnik generated data and that this is the source of the larger bottleneck on longest_match and slower compression times.

I've now also found that longest_match being called more is also replicable on OS X, locally.

And I've been able to isolate outside of vtcomposite. I wrote a CLI tool for gzip over at mapbox/gzip-hpp#29 and dumped the uncompressed output from vtcomposite and node-mapnik by adding this to the bench:

diff --git a/bench/bench.js b/bench/bench.js
index ea4fc3f..9175975 100644
--- a/bench/bench.js
+++ b/bench/bench.js
@@ -100,6 +100,7 @@ function runRule(rule, ruleCallback) {
           if (err) {
             throw err;
           }
+          fs.writeFileSync('testcase-vtcomp.mvt',result);
           return done(null,result);
         });
         break;
@@ -126,6 +127,7 @@ function runRule(rule, ruleCallback) {
             if (err) {
               return cb(err);
             }
+            fs.writeFileSync('testcase-mapnik.mvt',result.getData());
 
             let options = {compression:'none'}
             if (rule.options.compress){

Then I ran the CLI from mapbox/gzip-hpp#29 on each of these .mvt files and detected the count of longest_match in lldb like:

$ lldb -- ./build/gzip ../vtcomposite2/testcase-vtcomp.mvt
(lldb) b longest_match
(lldb) r
(lldb) c -i 999999997
(lldb) breakpoint list

Which outputs this for node-mapnik:

	1: name = 'longest_match', locations = 1, resolved = 1, hit count = 8375 Options: ignore: 999991623 enabled
	  1.1: where = gzip`longest_match, address = 0x00000001000084d0, resolved, hit count = 8375

And this for vtcomposite:

	1: name = 'longest_match', locations = 1, resolved = 1, hit count = 10805 Options: ignore: 999989193 enabled
	  1.1: where = gzip`longest_match, address = 0x00000001000084d0, resolved, hit count = 10805

@springmeyer
Copy link
Contributor Author

springmeyer commented Jul 23, 2018

Aha! Dumping the .mvt file using protoc shows they have a different order of feature fields...

mason/mason install protobuf 3.2.0
./mason_packages/osx-x86_64/protobuf/3.2.0/bin/protoc --decode_raw < testcase-mapnik.mvt > testcase-mapnik.raw.txt
./mason_packages/osx-x86_64/protobuf/3.2.0/bin/protoc --decode_raw < testcase-vcomp.mvt > testcase-vtcomp.raw.txt

And diff -u testcase-mapnik.raw.txt testcase-vtcomp.raw.txt gives a long diff with a lot of this:

   2 {
     3: 1
-    1: 2000
     4: "\t\240.\370."
+    1: 2000
   }

So node-mapnik features look like:

  2 {
    3: 1
    4: "\t\314\034\250-"
    1: 1
  }

In this order: 3(geometry_type), 4(geometry), 1(feature_id)

While vtcomposite features look like:

  2 {
    3: 1
    1: 1
    4: "\t\314\034\250-"
  }

In this order: 3(geometry_type), 1(feature_id), 4(geometry)

test data at https://gist.github.com/springmeyer/743828e76218b2d2e70a02e3dbd9ff12

@springmeyer
Copy link
Contributor Author

springmeyer commented Jul 23, 2018

Wow. The different field order explains the difference in speed between node-mapnik and vtcomposite.

Here is my proof:

I rebuilt node-mapnik (on linux/stageyprod) with this patch applied to mapnik-vt (which makes the field order the same as vtcomposite): https://gist.github.com/springmeyer/407798879f7b552a3af52e7d3ecc81f3

Then I ran the benchmark again and profiled and 💥 the massive longest_match bottleneck now shows up in node-mapnik:

screen shot 2018-07-22 at 8 33 11 pm

And the bench shows the difference too:

normal/unmodified node-mapnik:

node bench/bench.js --iterations 500 --concurrency 1 --package node-mapnik --compress --mem --only 'tiles completely made of points, overzooming, no properties'

1: tiles completely made of points, overzooming, no properties ... 1042 runs/s (480ms)
Benchmark peak mem (max_rss, max_heap, max_heap_total):  58.55MB 7.4MB 18.03MB
Benchmark iterations: 500 concurrency: 1

vtcomposite master:

# node bench/bench.js --iterations 500 --concurrency 1 --package vtcomposite --compress --mem --only 'tiles completely made of points, overzooming, no properties'

1: tiles completely made of points, overzooming, no properties ... 912 runs/s (548ms)
Benchmark peak mem (max_rss, max_heap, max_heap_total):  44.64MB 7.28MB 13.03MB
Benchmark iterations: 500 concurrency: 1

modified node-mapnik (https://github.com/mapnik/node-mapnik/compare/vtzero-feature-field-order) with crippled performance due to the new gzip bottleneck:

# node bench/bench.js --iterations 500 --concurrency 1 --package node-mapnik --compress --mem --only 'tiles completely made of points, overzooming, no properties'

1: tiles completely made of points, overzooming, no properties ... 780 runs/s (641ms)
Benchmark peak mem (max_rss, max_heap, max_heap_total):  58.54MB 7.41MB 18.03MB
Benchmark iterations: 500 concurrency: 1

@joto
Copy link
Contributor

joto commented Jul 23, 2018

The only guess I have here is that the different order either makes the data align better in some way or makes better use of CPU caches or so. The longest_match code in zlib is deeply magic and can optionally use an assembler implementation, so it is hard to tell what's going on in there.

If this is the case, it should also depend on the IDs used, whether they fit in a byte or need more. Of course we could change the order of the fields in vtzero, but that could make other cases worse and/or depend on the zlib version and other details. We need many more benchmarks before we could justify this. It might well be that with slightly other data, things are different and the benchmarks are misleading here, because the data they use is quirky in some way.

Oh, and there is a reason the feature_id comes first when using vtzero: When filtering by ID, it is faster that way.https://github.com/mapbox/vtzero/blob/master/include/vtzero/layer.hpp#L388-L419 has less work to do.

@springmeyer
Copy link
Contributor Author

We need many more benchmarks before we could justify this.

I think @millzpaugh and @alliecrevier may be seeing a similar pattern as my single/local test in production with a huge range of traffic. But they will need to confirm. If production traffic shows an overall slowdown with vtcomposite and perf output that shows longest_match is a larger bottleneck then I think we should consider what it would take to have vtcomposite/vtzero write fields in a different order.

Of course we could change the order of the fields in vtzero, but that could make other cases worse and/or depend on the zlib version and other details.

Yes, its possible. Per my first comment I think @millzpaugh and @alliecrevier should benchmark and profile with perf to compare node-mapnik vs vtcomposite with a lot of varied traffic and see if the pattern persists or not.

@springmeyer
Copy link
Contributor Author

Oh, and there is a reason the feature_id comes first when using vtzero: When filtering by ID, it is faster that way.https://github.com/mapbox/vtzero/blob/master/include/vtzero/layer.hpp#L388-L419 has less work to do.

👌 thanks for pointing that out. Which application has this usecase? (also, perhaps better to discuss this at /vtzero I realize)

@millzpaugh
Copy link
Contributor

Wow @springmeyer, thank you for such clear documentation throughout your debugging process.

@alliecrevier and I will pull two random samples of compositing traffic from the elb logs to run again and confirm that longest_match is a bottleneck in vtcomposite.

@millzpaugh millzpaugh added this to WIP in vtcomposite Jul 24, 2018
@millzpaugh
Copy link
Contributor

millzpaugh commented Aug 8, 2018

We are sticking with our current compression library gzip-hpp. Going to close this issue and reopen if we want to optimize feature field ordering + compression.

@springmeyer
Copy link
Contributor Author

We are sticking with our current compression library gzip-hpp. Going to close this issue and reopen if we want to optimize feature field ordering + compression.

👌 Good call to stick with the current for now. I will say however that I have a strong hunch that a speedup awaits us if we can dig into either changing the field order or moving to faster zlib implementation. But we can handle that in other tickets at a future time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

3 participants