Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Postgres connection polling or query connection limit #885

Closed
fabriciomurta opened this issue Dec 7, 2018 · 8 comments · Fixed by #959
Closed

Enable Postgres connection polling or query connection limit #885

fabriciomurta opened this issue Dec 7, 2018 · 8 comments · Fixed by #959

Comments

@fabriciomurta
Copy link

The default connection amount (--number-processes) of the tool currently seems to be too high for a default server configuration. Besides, not always a huge number of simultaneous connections works better than just a few.

The result now is that in a default postgresql server setup with, say, 100 max connections limit, which can be less depending on the system (see max_connection here), is being exceeded to import, for example the Canada data obtained from geofabrik.de exceeds by an unknown amount (it does not report before failing) with PostgreSQL 10, like this:

node cache: stored: 311544523(100.00%), storage efficiency: 57.40% (dense blocks: 27056, sparse nodes: 160538556), hit rate: 100.00%
Osm2pgsql failed due to ERROR: Connection to database failed: FATAL:  remaining connection slots are reserved for non-replication superuser connections

osm2pgsql could either pool the connections to the database, throttling it as soon as the first gets denied, use a smaller amount of connections, or get the current max_connections from the database and use no more than that (or use below it by a threshold to avoid conflict by other running applications), to a minimum of one (1) connection to the database.

The maximum connections set up to a database can be queried via the show max_connections command within the database. An example output from the client would be:

# show max_connections;
 max_connections
-----------------
 100
(1 row)

(works at least for postfix 9 an 10 versions -- mine is 10.6, and this one is probably 9.1

In this example, albeit the server had a limit for 100 connections, the application failed with the connection limit error message above. Setting --number-processes to 20, for example, worked fine.

As a quick fix for this, I'd say a number of 20 by default would be reasonable for most set-ups, while allowing a generous level of parallel queries running at once. Using the max connections amount available is not always true and should be analyzed in a case-by-case, so I'd say setting a very high connection amount is not good, by the server's defaults alone.

If the developer team things 20 is way too few, the choice for 100 would at least be likely to work in default postgres installations.

@lonvia
Copy link
Collaborator

lonvia commented Jan 15, 2019

The default number of processes is normally set to the number of processes available in the system. If you got a number that is larger than 20, it makes me wonder if you have an unusually large system or if something is going wrong with the detection.

Independently of that your suggestion to cap the number of processes in the default is still good one. If concurrency gets too high, postgres does not fare too well in my experience. I'd probably go for 15. @pnorman, any opinions?

@pnorman
Copy link
Collaborator

pnorman commented Jan 16, 2019

The default number of processes is normally set to the number of processes available in the system. If you got a number that is larger than 20, it makes me wonder if you have an unusually large system or if something is going wrong with the detection.

The number of connections used is about num tables * num processes, so I'd expect 80 connections for 20 processes, which would fit in the defaults. The limit is based on # of threads, not # of cores, so it's pretty easy to exceed 20 with only two sockets.

Independently of that your suggestion to cap the number of processes in the default is still good one. If concurrency gets too high, postgres does not fare too well in my experience. I'd probably go for 15. @pnorman, any opinions?

Postgres' parallelism has gotten better with each version, and I would go much higher than 15. It'd depend on hardware, but I'd generally try 75% of hardware threads, keeping in mind that some stages are single-threaded and will not parallize at all.

The problem that I see is that each process needs a connection for each table instead of one connection for the entire process. This is a real issue with the multi backend, where you might have 30+ tables.

@lonvia
Copy link
Collaborator

lonvia commented Jan 16, 2019

On further thought, 15 is still far too much. Parallel execution is used in two places now:

  • parallel sorting and indexing of tables on import
  • processing of pending ways and relations on updates

In both cases the number of parallel threads should not be too high. Indexing requires a lot of memory and having too many in parallel running actually slows down the process. Updates are generally run in parallel with other load (like rendering) and should therefore also not take up to many CPUs.

I'd rather go for max 4 as the default.

@fabriciomurta
Copy link
Author

fabriciomurta commented Jan 16, 2019

In my case the machine has 40 threads (2x 10-core w/ HT xeon). So, how many tables is that? 3? (nodes, ways, relations?). It easily broke beyond psql connection limit. I think I limited it to 20 processes with great results. I didn't run exhaustive tests with more and less processes though.

lonvia added a commit to lonvia/osm2pgsql that referenced this issue Jan 16, 2019
The processing stage opens quite a few connections to the
database. This can be avoided when no objects are pending.

Improves on osm2pgsql-dev#885.
@lonvia
Copy link
Collaborator

lonvia commented Mar 9, 2019

@pnorman Any objections having the default at 4 threads maximum?

@pnorman
Copy link
Collaborator

pnorman commented Mar 9, 2019

@pnorman Any objections having the default at 4 threads maximum?

I could see that working now that it doesn't impact a normal import.

@anneb
Copy link

anneb commented Oct 2, 2019

I did some testing with a smaller osm.pbf and the PostGres default max_connections of 100 on a many cores processor.
osm2pgsql always fails if option --number-processes is greater than 12.

Osm2pgsql failed due to ERROR: Connection to database failed: FATAL: remaining connection slots are reserved for non-replication superuser connections

osm2pgsql version 0.94.0 (64 bit id space)

@lonvia
Copy link
Collaborator

lonvia commented Oct 2, 2019

Thanks for testing. Version 1.0.0 might do a little bit better as we are now reusing some of the connections. It's still worth finally adding a limitation.

lonvia added a commit to lonvia/osm2pgsql that referenced this issue Oct 14, 2019
Parallel processing is used only when indexes are created and
in the second processing stage of updates. In both cases the
number of parallel threads should not be too high. Indexing
requires a lot of memory and having too many in parallel running
actually slows down the process. Updates are generally run in
parallel with other load (like rendering) and should therefore
also not take up too many CPUs.

Fixes osm2pgsql-dev#885.
tomhughes pushed a commit to tomhughes/osm2pgsql that referenced this issue Feb 12, 2020
The processing stage opens quite a few connections to the
database. This can be avoided when no objects are pending.

Improves on osm2pgsql-dev#885.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants