-
-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with query to get changed parent objects #2056
Comments
Do you have an EXPLAIN ANALYZE of the good and bad plan? |
An EXPLAIN ANALYSE is naturally not possible short of waiting for a couple of hours. Bad plan:
Good plan:
|
Bad plan: https://explain.depesz.com/s/S6gG On the server getting the bad plan, can you force it to the good plan by overwriting Can you get an EXPLAIN ANALYZE for just the good plan? The bad plan is materializing the CTE. The documentation for 14 and 15 states
I guess it can still materialize a CTE just like it can materialize a query plan node if it's required for the plan. Both plans are estimating 20 million rows. This sounds very off, as we're doing batches of 1000, and indicates the problem is likely statistics-related. The HashAggregate node statistics look reasonable as it is <1000. Reality is probably higher than 215 if places are randomly selected. The estimate of 9.6 million rows for the good plan's bitmap index scan seems high. I would expect PostgreSQL is already tracking the univariate statistics for |
Sorry if that wasn't clear but the underlying issue is indeed the bad statistics for GIN indexes on array columns. That's a well known and long-standing issue. Last time it did bite us was when JIT and parallel indexing was introduced (#1045). Problem is that our tables have now reached a size where Postgres resorts to sequential scans instead of using the index. This is bad because forcing Postgres to use an index is much more difficult than preventing it from using an index or JIT or parallel processing. The ideas above are about how to best do that without breaking things in a slightly different setup. Tried
|
Can the query be updated to |
I tried with |
The query we had before would sometimes result in really bad performance when the query planner in PostgreQL decides that it needs to use sequential scans on the ways table. This is due to the statistics being way off. By running the query for each node ids on its own we hopefully trick the query planner into always using an index lookup. See osm2pgsql-dev#2056
The query we had before would sometimes result in really bad performance when the query planner in PostgreQL decides that it needs to use sequential scans on the ways table. This is due to the statistics being way off. By running the query for each node ids on its own we hopefully trick the query planner into always using an index lookup. See osm2pgsql-dev#2056
Workaround is merged. Closing here. |
…ions (#1) * Generalizer: Don't run more threads than there are tiles * Store -O/--output setting in properties and use for append * Add missing header file * Refactor db_target_descr_t: Always need schema on construction We now always construct this fully with schema, name, and id column. This way we make sure all data is there and the schema is not defaulted. * Make db_target_descr_t into a real class * Store -S/--style setting in properties and use for append * Add new command line parameter --schema that sets the default schema This will be used as default for --middle-schema, --output-pgsql-schema, and for the different ways of setting a schema in the flex output and generalizer code. This removes the last places where the schema was hardcoded to "public" (except as a default for this command line parameter and in some legacy gazetteer code). * Fix: Remove assert for the id column This is actually not required for COPY operation, only for DELETEs. The flex output can create tables without id. * add --schema option to osm2pgsql-replication * update manpage of osm2pgsql-replication * Update documentation of schema options in man pages * prepare release 1.9.0 * use r"" strings for regular expressions * bdd: make them work with psycopg3 * avoid get_dsn_parameters which no longer exists in psycopg3 Fixes osm2pgsql-dev#2040. * Fix architecture-dependent double to integer conversion Fixes osm2pgsql-dev#2041 * CI: run some builds with psycopg3 * Some small code cleanups * Make m_task_result private * Don't use unchecked array subscript Makes clang-tidy happy if we use at() and it doesn't cost us much here. * Use std::array instead of C array * Pass points as values, other geoms as reference to add_geometry() Points are small and not movable. Removes a complaint from clang-tidy about pessimizing std::move. * prepare release 1.9.1 * Get output_requirements struct its own include file * Suppress clang-tidy error on main functions The main functions now catch all exceptions, but there is still the possibility that the logging of the error message throws again. Not much we can do about this. * Avoid narrowing conversion * Refactor tests triggering a moved-from-warning Clang-tidy doesn't understand that only one SECTION is run per test so it thinks data is used after move. This removes some of the SECTIONs which don't really do things relevant to the tests here. And one test is split into two so we don't need the SECTION any more. * Github action: Test with PostgreSQL 15 instead of 14 * Use colon syntax consistently when calling Lua functions on objects * Revert "Refactor code that gets a mutex and calls into Lua code" This reverts commit 4dca255. The select_relation_members() reads data from the Lua stack after calling call_lua_function(). It is important that this is still protected by the mutex. * Add comment in the hope that we don't make this error again * Disable the warning that cache is disabled in append mode In append mode we don't necessarily have the flat node setting from the command line, because it is read from the properties file. So this would give us an unnecessary and wrong warning. This is just a band aid though, a better fix would be to check after we have read the properties file. But that needs some larger changes I'll leave for when we refactor the command line parsing code. * Generalization: Allow list of SQL commands in run_sql() command The run_sql() command now can have either a single SQL statement in the `sql` field or a list of SQL commands. The new `transaction` field can be set to `true` to wrap the SQL commands in BEGIN/COMMIT. * Refactor: Build a list of (templated) queries, then run them * Move DISTINCT from queries generating list to when we read the data We had some problems with the query planner getting confused with those queries. Lets make them as simple as possible to not confuse it unnecessarily. This might be vodoo, but it doesn't cost much, just a little bit of temporary space in the database, because when reading we do the ORDER BY anyway, so the extra DISTINCT isn't more cost. * Put query for finding parent ways from nodes inside a PL/pgSQL function The query we had before would sometimes result in really bad performance when the query planner in PostgreQL decides that it needs to use sequential scans on the ways table. This is due to the statistics being way off. By running the query for each node ids on its own we hopefully trick the query planner into always using an index lookup. See osm2pgsql-dev#2056 * Generalizer: Show some more information in log level info Now we can see SQL queries and generalizers run with some timing info already in log level info. * Clean tests: Remove 'name' field from expire which is never used This is a left-over from earlier use of the 'name' field. * Actually add geometries in the test Instead of just NULL geometries to make them a bit more realistic. * Faster hex decoder * Better implementation for low-level bit-fiddling in tracer code * Generalization: Allow setting condition for running SQL commands If a condition is set with "if_has_rows", the SQL command in "if_has_rows" will be run and only if that returns any rows, the main sql command is run. This check happens only in "append" mode, in "create" mode the command is always run. To be used with something like "SELECT 1 FROM some_table LIMIT 1" to check whether the table has any rows (for instance an expire table). * prepare release 1.9.2 * Check in CMake config that we are building for 64bit architecture For a long time we haven't officially supported 32bit architectures but not really disabled them. But this is giving us more problems lately and we can't test on 32 bit systems anyway, so this now gives us an up front check. * Add runtime warnings that area type and add_row() are deprecated Only one message about add_row() is shown. * Rename "docs" directory to "man" And updates various docs accordingly. For a while we haven't kept any docs in there other than the man pages. The new name better reflects what this directory is all about. * Github actions: Update checkout to v4 in workflow files * Make sure we are always testing with an empty grid in --append tests * Test that no maxzoom setting works on you'll get expire on zoom 0 * Generalization: Wrap upload of data in transaction So we don't use so many transaction ids. * Generalizer: Log tile being processed also in multithreaded mode Not only in singlethreaded more. * Generalizer: Simplify tracer code, use local instead of member vars * Generalizer bugfix: Set delete_existing earlier In append mode we need to delete existing tiles. We need to set this parameter earlier, so the generalizer is initialized correctly. * Check values found in expire tables (must be number in valid range) * Add first/last timestamps to expire tables This adds two timestamp (with timezone) columns to all expire tables called "first" and "last". When an entry is added, both are set to the current (transaction) timestamp. When an entry already exists a new insert will result in the "last" timestamp being updated. Having these two timestamps allows various expire/updating strategies, for instance: * update oldest entry * update entry that didn't change for the longest time * update older entries but only if there are no recent changes, which indicates that there might be more changes coming For backwards compatibility the code detects which table format is used and falls back to the old behaviour if the timestamp columns aren't there. * Github actions: Test with newer PostgreSQL versions (up to 16) * Avoid copy of vector in get_tiles_from_table() function Use out parameter with ptr instead. * Generalization: Switch from using CImg to OpenCV library. Processing is an order of magnitude faster. * Require libopencv-dev on Debian systems This package contains the CMake files needed to build osm2pgsql. * Refactor: Modernize copy functions using std::string_view * Refactor: Modernize code, use nested namespace definition * Refactor: Modernize get_count and require_has_table Use string_view parameters * Github action: Use clang 15 for C++20 test * Use tags = NULL in middle tables if object doesn't have any tags This doesn't make much of a difference for the ways and rels table, but if we store all nodes in the database, it does make a huge difference, because most nodes don't have any tags. For a current planet, disk usage for the nodes table goes from 476 GB down to 409 GB saving 67 GB or nearly 15%. Additionally it makes use of that table simpler. If you want to do any queries on tags, you need an index on the tags column on the nodes/ways/rels tables like this: CREATE INDEX ON planet_osm_ways USING gin (tags); But that is wasteful, because of the empty tags. We probably want to generate them as CREATE INDEX ON planet_osm_ways USING gin (tags) WHERE tags != '{}'::jsonb; But now all queries on those tables have to include that extra condition so that the query planner will use the index. SELECT * FROM planet_osm_ways WHERE tags ? 'highway' AND tags != '{}'::jsonb; If we use NULLs, the index can be created as: CREATE INDEX ON planet_osm_ways USING gin (tags) WHERE tags IS NOT NULL; And now the query becomes simpler, because the NOT NULL is automatically taken into account by the query planner: SELECT * FROM planet_osm_ways WHERE tags ? 'highway'; Note that this is an incompatible change to the new format middle tables, but they are still marked as experimental, so we can do this. * Allow NULL values in new middle format members list This makes osm2pgsql a bit more future proof by allowing the list of members (which is encoded as JSON in the new middle format) to be empty, i.e. to contain NULL. We currently don't write empty member lists as NULL but as an empty JSON list, but if we change this in the future, older versions of osm2pgsql will be able to read this correctly. * Update included fmt lib to 10.1.0 * Update included libosmium to version 2.20.0 * Increase the max number of jobs for osm2pgsql-gen to 256 We have a limit of 256 set in line 688 already. Fixes osm2pgsql-dev#2086 * Github action: Remove OpenCV lib from windows build It is only needed for building osm2pgsql-gen but that also needs potrace which is not installed, because it isn't available as vcpkg. * Github action: Disable build on macOS 11 Because it doesn't work any more. Github actions output shows this message: Error: You are using macOS 11. We (and Apple) do not provide support for this old version. It is expected behaviour that some formulae will fail to build in this old version. It is expected behaviour that Homebrew will be buggy and slow. Do not create any issues about this on Homebrew's GitHub repositories. Do not create any issues even if you think this message is unrelated. Any opened issues will be immediately closed without response. Do not ask for help from Homebrew or its maintainers on social media. You may ask for help in Homebrew's discussions but are unlikely to receive a response. Try to figure out the problem yourself and submit a fix as a pull request. We will review it but may or may not accept it. * Add proj.db to Windows artifact See osm2pgsql-dev#1949 * Make CMake quiet if opencv ist not found * prepare release 1.10.0 * fix: generate expire-tiles for non mercator projections * ci: temp * fix: generate expire-tiles output --------- Co-authored-by: Jochen Topf <jochen@topf.org> Co-authored-by: Sarah Hoffmann <lonvia@denofr.de> Co-authored-by: Paul Norman <penorman@mac.com>
Version 1.9.0 changed the way we find parents of changed objects, i.e. ways that contain a changed node and relations that contain a changed node or way. We used to query the database for each changed object, which was rather wasteful. The new code does this check once for all changed nodes and once for all changed ways.
I had tested this with various changesets (including huge ones you would never have in real life) and never saw any problems. The performance was always better or at least not worse then before. For large changesets the performance could be an order of magnitude better!
Now we are seeing problems with this approach on one of the OSMF Nominatim servers. Strangely enough this only happens on one server, the others don't have this problem. We could narrow this down to the server choosing the wrong query plan involving a full table scan on the ways table. But it seems this only happens sometimes and is dependent on various PostgreSQL settings and also on the number of node ids we want to find the parents for. So the whole thing seems rather brittle and even if we get this fixed today for this server, we don't know what will happen tomorrow.
How do we fix this?
This is the function we are talking about here
Ideas:
DISTINCT
clause seems to influence the planner sometimes. We can do that later. Code here: https://github.com/openstreetmap/osm2pgsql/compare/master...joto:osm2pgsql:find-parent-deps?expand=1@pnorman any ideas?
The text was updated successfully, but these errors were encountered: