Skip to content

Conversation

@missinglink
Copy link
Member

@missinglink missinglink commented Sep 10, 2025

this PR sets the default Douglas-Peuker simplification tolerance for both geometry and shard tables to 0.0001

in practice this value is so small it is visually indistinguishable, but has the benefit of reducing the database size by up to 75% 😱

see: https://www.seabre.com/simplify-geometry/

cc/ @Joxit I think this may be preferable to updating the shard complexity, or maybe we do both 🤔

@missinglink missinglink force-pushed the light-geometry-simplification branch from d2af7c0 to 3f0cbc1 Compare September 10, 2025 05:17
@missinglink missinglink merged commit 55c3660 into master Sep 10, 2025
1 check passed
@missinglink missinglink deleted the light-geometry-simplification branch September 10, 2025 05:45
@Joxit
Copy link
Member

Joxit commented Sep 12, 2025

Hi there, I wanted to see how much this PR was interesting at perf level. I'm a bit surprised by the result, I think I missed something, could you do some tests with your tools too ?

Based on #106 (comment)

Database sizes

SELECT COUNT(*) from shard;
simplify size shards
0.0 6079M 722390
0.0001 4366M 722282

We save almost 28% disk space by removing 0.000149504% of shards xD

Number of shards by geometries

SELECT AVG(count), MIN(count), stats_median(count), stats_p95(count), stats_p99(count), MAX(count) FROM (SELECT COUNT(*) AS count FROM shard GROUP BY source, id);
simplify avg min 50th 95th 99th max
0.0 1.72975341573561 1 1.0 4.0 12.0 2194
0.0001 1.72949481114682 1 1.0 4.0 12.0 2188

Almost the same here

SELECT AVG(count), MIN(count), stats_median(count), stats_p95(count), stats_p99(count), MAX(count) FROM (SELECT COUNT(*) AS count FROM shard GROUP BY source, id HAVING COUNT(*) > 1);
simplify avg min 50th 95th 99th max
0.0 4.70975752264096 2 3.0 12.0 35.0 2194
0.0001 4.7093007682661 2 3.0 12.0 35.0 2188

Almost the same here

Number of points by shards

SELECT AVG(count), MIN(count), stats_median(count), stats_p95(count), stats_p99(count), MAX(count) FROM (SELECT ST_NPoints(geom) AS count FROM shard);
simplify avg min 50th 95th 99th max
0.0 86.1006976840765 4 87.0 181.0 195.0 199
0.0001 86.0987412118812 4 87.0 181.0 195.0 199

Almost the same here

Gatling Configuration

Since I'm using a French dataset, I will query only in French regions, this is the environments USERS_COUNT=100000, USERS_RAMP_TIME=60, SERVER_URL=http://$SPATIAL_IP/query/pip/_view/pelias and CSV I used:

Region,LatMin,LatMax,LngMin,LngMax
AUVERGNE-RHONE-ALPES,44.1154,46.804,2.0629,7.1859
BOURGOGNE-FRANCHE-COMTE,46.1559,48.4001,2.8452,7.1435
BRETAGNE,47.278,48.9008,-5.1413,-1.0158
CENTRE-VAL DE LOIRE,46.3471,48.9411,0.053,3.1286
CORSE,41.3336,43.0277,8.5347,9.56
GRAND EST,47.4202,50.1692,3.3833,8.2333
HAUTS-DE-FRANCE,48.8372,51.089,1.3797,4.2557
ILE-DE-FRANCE,48.1205,49.2413,1.4465,3.5587
NORMANDIE,48.1799,50.0722,-1.9485,1.8027
NOUVELLE-AQUITAINE,42.7775,47.1758,-1.7909,2.6116
OCCITANIE,42.3331,45.0467,-0.3272,4.8456
PAYS DE LA LOIRE,46.2664,48.568,-2.6245,0.9167
PROVENCE-ALPES-COTE D'AZUR,42.9818,45.1268,4.2303,7.7188

Specs

OS: Debian 12.12
CPU: 4 thread (2.3ghz)
RAM: 15Go
Arch: linux/amd64
Docker Version: 28.3.2
Container: `pelias/spatial:master-2025-09-10-55c3660793cbc328aeaa98f7966b5572f6174f51`

Results

simplify KO % KO Cnt/s Min 50th 75th 95th 99th Max Mean Std Dev
0.0 0 0% 1,639.34 10 12 13 18 27 108 13 4
0.0001 0 0% 1,587.3 12 312 426 856 45,264 56,841 1,429 6,542

I don't understand why the result on the simplified polygons are that bad, max at 56s 😱. I checked all the stats, CPU/RAM/Network/IO, I see only one inconsistency, the IO read speed is between 2.3MiB/s to 8.2MiB/s with simplify 0.0 and between 350MiB/s and 500MiB/s with simplify 0.0001.

I don't know why this is happening, maybe a issue with indexes ? Or my server ?

Viewer differences

GET /explore/pip#16/48.788510/2.310809

Slight differences.

simplify-diff

@missinglink
Copy link
Member Author

missinglink commented Sep 15, 2025

@Joxit yeah that's super weird, I can't think of any reason why simplified polygons would be slower 😆

Anyway, to sanity check I did perform the benchmark again, it's fairly basic in that it only tests a single URL, maybe that's the difference? Can you try this method?

git log -1
commit 5d6193c551a68b937cbcb2bbaef289d2fe1ba65b (HEAD -> master, origin/master, origin/HEAD)
Author: Peter Johnson <insomnia@rcpt.at>
Date:   Fri Sep 12 11:29:36 2025 +0200

    improvement(demo): set default simplification to 0

data is the New Zealand extract of WOF (I was looking for something small and fairly representative of real-world data, rather than the ZCTA dataset):

.rw-r--r--@ 426Mi peter 10 Sep 10:52  /data/wof/sqlite/whosonfirst-data-admin-nz-latest.db
322529f0dc8189e950626e5176c34b70304350a7  /data/wof/sqlite/whosonfirst-data-admin-nz-latest.db

generate two databases:

sqlite3 /data/wof/sqlite/whosonfirst-data-admin-nz-latest.db 'SELECT json_extract(body, "$") FROM geojson' | node bin/spatial.js --db=nz.0.0001.spatial.db import whosonfirst --tweak_module_geometry_simplify=0.0001 --tweak_module_shard_simplify=0.0001
sqlite3 /data/wof/sqlite/whosonfirst-data-admin-nz-latest.db 'SELECT json_extract(body, "$") FROM geojson' | node bin/spatial.js --db=nz.0.0.spatial.db import whosonfirst --tweak_module_geometry_simplify=0.0 --tweak_module_shard_simplify=0.0
/code/pel/spatial master = *6 ?21 ❯ l nz.0.*
.rw-r--r--@ 425Mi peter 15 Sep 11:19 -I  nz.0.0.spatial.db
.rw-r--r--@ 104Mi peter 15 Sep 11:18 -I  nz.0.0001.spatial.db

run server

node bin/spatial.js server --db=nz.0.0.spatial.db

or/

node bin/spatial.js server --db=nz.0.0001.spatial.db

k6 testing

cat load.js
import http from 'k6/http'
const baseurl = 'http://localhost:3000/query/pip/_view/pelias'

export default function () {
  const lon = '174.77607'
  const lat = '-41.28655'
  http.get(`${baseurl}/${lon}/${lat}`)
}
k6 run --vus 20 --iterations 10000 load.js

results show significant improvement from simplification (0.0 on the left and 0.0001 on the right)

19,20c19,20
<     http_req_duration.......................................................: avg=6.34ms min=1.23ms med=5.59ms max=55.37ms p(90)=10.52ms p(95)=12.11ms
<       { expected_response:true }............................................: avg=6.34ms min=1.23ms med=5.59ms max=55.37ms p(90)=10.52ms p(95)=12.11ms
---
>     http_req_duration.......................................................: avg=3.29ms min=641µs    med=2.33ms max=37.26ms p(90)=6.74ms p(95)=8.94ms
>       { expected_response:true }............................................: avg=3.29ms min=641µs    med=2.33ms max=37.26ms p(90)=6.74ms p(95)=8.94ms
22c22
<     http_reqs...............................................................: 10000  3121.513659/s
---
>     http_reqs...............................................................: 10000  5972.54897/s
25,26c25,26
<     iteration_duration......................................................: avg=6.38ms min=1.25ms med=5.64ms max=55.41ms p(90)=10.58ms p(95)=12.16ms
<     iterations..............................................................: 10000  3121.513659/s
---
>     iteration_duration......................................................: avg=3.33ms min=669.37µs med=2.36ms max=37.29ms p(90)=6.82ms p(95)=9.03ms
>     iterations..............................................................: 10000  5972.54897/s
31,32c31,32
<     data_received...........................................................: 11 MB  3.4 MB/s
<     data_sent...............................................................: 1.1 MB 350 kB/s
---
>     data_received...........................................................: 11 MB  6.5 MB/s
>     data_sent...............................................................: 1.1 MB 669 kB/s

maybe my method is flawed? can you try to reproduce this?

@missinglink
Copy link
Member Author

Can you please check you're using the /query/pip/_view/pelias endpoint in both cases?

In the past this Pelias view (which is backwards-compatible with wof-admin-lookup) was much slower, I'm working on getting it closer to the performance of /query/pip.

@missinglink
Copy link
Member Author

Worth mentioning I actually changed the default from 0.0001 to 0.00003, probably shouldn't affect this, just worth mentioning.

@missinglink
Copy link
Member Author

These changes to the shard counts are far more significant than what you posted, of course this will depend heavily on how over-detailed the geometries were in the first place, I'm very surprised by your comment:

We save almost 28% disk space by removing 0.000149504% of shards xD

spatialite -silent nz.0.0.spatial.db 'SELECT COUNT(*) FROM shard'
124346

spatialite -silent nz.0.0001.spatial.db 'SELECT COUNT(*) FROM shard'
25212

@Joxit
Copy link
Member

Joxit commented Sep 15, 2025

Can you please check you're using the /query/pip/_view/pelias endpoint in both cases?

In the past this Pelias view (which is backwards-compatible with wof-admin-lookup) was much slower, I'm working on getting it closer to the performance of /query/pip.

HA HA HA HA HA HA, thanks, I messed up this part 🤣 the /query/pip/_view/pelias was used only on the simplified version, the other endpoint was used on the non simplified, sorry false alarm 😅

I success to increase the number of users to 90k, when I increase the number of users, results are inconsistent, I think this is the breaking point.

The overall stats are now logical, the simplified version is a bit faster, here are the new results :

simplify KO % KO Cnt/s Min 50th 75th 95th 99th Max Mean Std Dev
0.0 0 0% 1,475.41 11 18 23 41 59 923 22 27
0.0001 0 0% 1,475.41 10 17 21 31 43 265 19 8

So everything looks fine now 😄

In France, WOF data is already simplified, maybe that's why I don't have a big shard difference ?
The simplified database is saving only 108 shards and I have 95% of the geometries between 1-195 points per shards 🤷

@missinglink
Copy link
Member Author

missinglink commented Sep 16, 2025

@Joxit agh funny, if you get a chance could you please benchmark this PR?

I don't like that the /query/pip/_view/pelias endpoint is so slow and inconsistent in your testing, that PR should hopefully reduce those nasty P99 scores as well as reducing all of the latencies across the board.

In order for it to work the database needs to have a summary table, if you don't have it you can regenerate the database using the latest code or generate it manually.

Docker image of that PR branch: https://hub.docker.com/layers/pelias/spatial/pip-pelias-summary-2025-09-12-6089d97b00eb2b9694a382a0fe20d760dec7adea/images/sha256-5ae942e1c54dbfe41f663597660976bb6d5a08effed5453a91867f25390f1bff

@Joxit
Copy link
Member

Joxit commented Sep 16, 2025

Okay I will try your PR tomorrow !

I will try to publish a benchmark with 90k users and try higher one next

Note for myself : don't forget to update the images at build time 😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants