Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Upgrades for Diff-Updates #1433

Closed
benedikt-brandtner-bikemap opened this issue Oct 25, 2022 · 19 comments
Closed

Performance Upgrades for Diff-Updates #1433

benedikt-brandtner-bikemap opened this issue Oct 25, 2022 · 19 comments

Comments

@benedikt-brandtner-bikemap
Copy link
Contributor

When applying diff-updates to a planet-import once a week by merging daily OSM-Change-Files from the past week I encountered very long processing times of up to 10 days.

After going through all refresh functions I identified a few key-issues which I believe should address the general performance of diff-updates across all refresh functions with only minimal cost to import time:

  1. Refactor diff-updates to tables containing merged LineStrings to be ID-Based instead of Tag-Based or even full
  2. Refactor Buildings update from full to partial-diff-update
  3. Refactor Poi-Point updates from full to partial-diff-updates
  4. Refactor Water-Point/Lakeline updates from full-incremental to partial-diff-updates
  5. Refactor Park-Polygon update from full-incremental to partial-diff-update
  6. Adding partial Indices on tables affected by the update matching the update query
  7. Restrict entries in changes-tables to be unique by adding Primary Keys
  8. Restrict Diff-Updates to INSERT and UPDATE operations where applicable
  9. Refactor IN queries to EXISTS queries
  10. Refactor CTEs in refresh-functions to temporary tables with indexes
  11. Analyzing tables before and during diff-updates

I have implemented solutions to this issues except for point 2 in this branch at our fork and would like to ask if you would be interested in merging them back into the main repository.

Environment:

AWS m5.2xlarge EC2-Instance (EC2 not RDS)

  • 8 Cores
  • 32GB RAM

AWS GP3 Volume

  • 1000GB
  • 6000 IOPS
  • 125 mb/s max-throughput
  • EXT4 Filesystem

Operating System and Libraries:

  • Amazon-Linux 2 with Linux-Kernel 5.x
  • PostgreSQL 14
  • PostGIS 3.3.1
  • Proj 4.9.3
  • GEOS 3.9.3
  • GDAL 2.4.4
  • Protobuf 3.14.0
  • Protobuf-C 1.3.3
  • LibKakasi 2.3.6
  • UTF8-Proc 2.5.0
  • Mapnik-German 2.5.9.1
  • PGSQL-gzip 1.0.0

PostgresSQL-Config:

  • jit = 'OFF'
  • shared_buffers = '7424MB'
  • work_mem = '23756kB'
  • maintenance_work_mem = '2GB'
  • effective_io_concurrency = 200
  • max_worker_processes = 8
  • max_parallel_maintenance_workers = 4
  • max_parallel_workers_per_gather = 4
  • max_parallel_workers = 8
  • wal_buffers = '16MB'
  • max_wal_size = '16GB'
  • min_wal_size = '4GB'
  • checkpoint_completion_target = '0.9'
  • random_page_cost = '1.1'
  • effective_cache_size = '22272MB'
  • default_statistics_target = 500
@lazaa32
Copy link
Collaborator

lazaa32 commented Oct 28, 2022

Hi @benedikt-brandtner-bikemap
do you have any numbers to see the impact on execution time? You mentioned processing time up to 10 days on current master, how long did it take on your branch?

@benedikt-brandtner-bikemap
Copy link
Contributor Author

Hi, unfortunately i didn't backup the logs and didn't do a new import / update on the planet db yet.

Will run a new import and update with and and without the changes in our development stage with data from austria later today and will post the results as a diff as soon as it's done.

@benedikt-brandtner-bikemap
Copy link
Contributor Author

Sorry for taking a bit longer but it took a while until i compared enough update runs to have an a better understanding of the impact on the different changes on update execution times with medium sized data sets and larger osm-changes.

Table with execution time comparison as well as links to branches and logs can be found below.

After comparing I have come to revise my initial conclusion on issues to address:

  1. Refactor diff-updates to tables containing merged LineStrings to be ID-Based instead of Tag-Based or even full
  2. Refactor Buildings update from full to partial-diff-update
    • Still at over 8 minutes execution time and now the longest running update function but no solution yet
  3. Refactor Poi-Point updates from full to partial-diff-updates
  4. Refactor Water-Point/Lakeline updates from full-incremental to partial-diff-updates
  5. Refactor Park-Polygon update from full-incremental to partial-diff-update
  6. Adding partial Indices on tables affected by the update matching the update query
    • No positive effect on update execution time could be observed
  7. Restrict entries in changes-tables to be unique by adding Primary Keys
    • Reduces the amount of affected rows during updates
  8. Restrict Diff-Updates to INSERT and UPDATE operations where applicable
    • Reduces the amount of affected rows during updates
  9. Refactor IN queries to EXISTS queries
    • No positive effect on update execution time could be observed
  10. Refactor CTEs in refresh-functions to temporary tables with indexes
    • No positive effect on update execution time could be observed
  11. Analyzing tables before and during diff-updates

Base-Commit
Improved-Commit

Import:

Base Improved Difference Percent
ImpOSM 00:06:33.229 00:06:35.175 +0:00:02 +0.49%
SQL 00:20:45.000 00:21:46.000 +0:01:01 +4.90%
VACUUM / ANALYZE 00:04:11.000 00:04:10.000 -0:00:01 -0.40%

Base.log Improved.log

Update:

Base Improved Difference Percent
ImpOSM DELETE 00:45:19.000 00:46:20.093 +0:01:01 +2.25%
ImpOSM INSERT/UPDATE 00:38:39.000 00:37:56.907 -0:00:42 -1.82%
ImpOSM Generalized 00:09:37.000 00:09:21.048 -0:00:16 -2.76%
SQL with new Features 00:31:07.000 00:21:10.000 -0:09:57 -31.98%
SQL w/o new Features 00:31:07.000 0:18:39 -0:12:27 -40.03%
VACUUM / ANALYZE 00:15:49.000 00:17:16.000 +0:01:27 +9.17%

Base.log Improved.log

Update SQL-Functions:

Base Improved Difference Percent
aerodrome_label.refresh 00:00:00.037 00:00:00.029 -0:00:00.008 -21.98%
buildings.refresh 00:08:03.650 00:08:01.721 -0:00:01.929 -0.40%
housenumber.refresh 00:00:45.459 00:00:54.646 +0:00:09.186 +20.21%
mountain_linestring.refresh 00:00:00.300 00:00:00.723 +0:00:00.424 +141.27%
mountain_peak_point.refresh 00:00:00.485 00:00:00.783 +0:00:00.298 +61.34%
park_polygon.refresh 00:00:00.166
place_city.refresh 00:00:02.850 00:00:02.924 +0:00:00.074 +2.58%
place_country.refresh 00:00:01.167 00:00:01.105 -0:00:00.061 -5.23%
place_island_polygon.refresh 00:00:00.013 00:00:00.022 +0:00:00.010 +75.40%
place_state.refresh 00:00:00.149 00:00:00.139 -0:00:00.010 -6.71%
poi_point.refresh 00:00:13.317 00:00:21.045 +0:00:07.729 +58.04%
poi_polygon.refresh 00:00:08.249 00:00:14.537 +0:00:06.287 +76.22%
transportation_name.refresh_aerialway_linestring 00:02:09.437
transportation_name.refresh_name 00:11:31.794 00:02:46.788 -0:08:45.006 -75.89%
transportation_name.refresh_network 00:04:24.957 00:04:37.733 +0:00:12.776 +4.82%
transportation_name.refresh_shipway_linestring 00:00:20.967
transportation.refresh_z11 00:01:48.517 00:01:02.839 -0:00:45.678 -42.09%
transportation.refresh_z8 00:03:59.393 00:00:14.254 -0:03:45.139 -94.05%
water_name.refresh 00:00:06.543
waterway_important.refresh 00:00:05.670 00:00:13.271 +0:00:07.601 +134.06%

Base.log Improved.log

I hope that gives a little bit more insight on the impact of the proposed changes and also a reason for my proposal.

Cheers

@benedikt-brandtner-bikemap
Copy link
Contributor Author

Sorry to bug you but, is there any interest in these changes (especially the LineString merging) or not all?

greets

@ZeLonewolf
Copy link
Contributor

I think @frodrigo has done the most work recently on the diff updates.

@TomPohys
Copy link
Member

TomPohys commented Dec 5, 2022

The progress on the transportation layer is awesome. I will look into it more deeply. As @frodrigo is the expert on Diffs, his opinion is more than welcome! Thank you for you work and your patience.

@benedikt-brandtner-bikemap
Copy link
Contributor Author

Awesome :)

If I can be of any assistance just let me know

@benedikt-brandtner-bikemap
Copy link
Contributor Author

Hey Happy new Year everyone 🎉 !

Any updates here, should I rebase my changes ontop of the most recent ones?

@TomPohys
Copy link
Member

Happy new year to you too 🎉

Sorry for such a delay and thanks for your patience! It would be great if you could rebase it. Thank you!

@benedikt-brandtner-bikemap
Copy link
Contributor Author

Hey,

have rebased my changes on the current master here.

But just FYI two tests are currently failing in the master branch and are still failing in my rebased branch
image

@TomPohys
Copy link
Member

Thanks a lot! I try to investigate...

@benedikt-brandtner-bikemap
Copy link
Contributor Author

Hey,

have rebased my changes again on the current master here.

greets

@TomPohys
Copy link
Member

You are awesome! I am going to look at it now!

@benedikt-brandtner-bikemap
Copy link
Contributor Author

No problem, thanks a lot :)

@TomPohys
Copy link
Member

Hi @benedikt-brandtner-bikemap, Thank you for your patience. Are you still willing to create PR? It looks good! There are some PRs to be merged, but they should not compromise diffs. Thanks!

@benedikt-brandtner-bikemap
Copy link
Contributor Author

Hey @TomPohys,

no worries thanks for having a look and yes i'd very much like to create the PR 🙂

Should i create the PR as one big PR or should i separate it and create one for each commit?

greets

@TomPohys
Copy link
Member

From just an esthetic perspective it will be better to have separate PR for each layer. But it is up to you. Thanks!

@benedikt-brandtner-bikemap
Copy link
Contributor Author

Hey @TomPohys

took a bit to separate into multiple branches by layerby layer but i finally managed to do it and opened the PRs.

During rebasing I encountered a bug with motorways in zoom-level 4, fixed it and the remaining local test-issues, and created a PR (#1505) which acts as a base for all the other PRs.
So it would be nice if you could also merge it first. As soon as this PR is merged all the other PRs containing the update improvements can be merged independently.

I hope that works for you and if you have any questions just shoot me a message ;)

Greets!

TomPohys pushed a commit that referenced this issue Mar 21, 2023
Improved update performance of aerodrome_label layer
- Refactored IDs to be unique in aerodrome_label.osm_ids
- Restricted updates to INSERT and UPDATE operations during aerodrome_label.refresh
- Added analyze statements before update queries during aerodrome_label.refresh

See #1433 for more detailed discussion.
@TomPohys
Copy link
Member

Thank you for all your work! I think we can close this now.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants