Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make check keeps CPU busy for over 2 hours #12

Closed
strk opened this issue Apr 24, 2020 · 14 comments
Closed

make check keeps CPU busy for over 2 hours #12

strk opened this issue Apr 24, 2020 · 14 comments

Comments

@strk
Copy link
Collaborator

strk commented Apr 24, 2020

I've started make check before lunch break and it was still running and CPU spinning after over 2 hours. I think I'm going to kill it. Terminal output:

PostgreSQL 12.2 (Ubuntu 12.2-2.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0, 64-bit
  Postgis 3.1.0dev - r3.1.0alpha1-63-gb6b33b0dc - 2020-04-24 09:47:45
  scripts 3.1.0dev 3.1.0alpha1-63-gb6b33b0dc
  GEOS: 3.9.0dev-CAPI-1.14.0
  PROJ: 7.0.0

Running tests

 ./resolve_overlap_and_gap ...

And it's sitting there.
Relevant top(1) output (note 25% CPU means 100% of 1 core in my quadcore):

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                       
14583 postgres  20   0  489008 132156 100732 S  25.2  0.8  24:15.38 postgres: 12/main: strk nibio_reg [local] CALL       

Is this expected behaviour ?

@larsop
Copy link
Owner

larsop commented Apr 24, 2020

No it should take less than a minute, but on https://travis-ci.org/github/larsop/resolve-overlap-and-gap it takes around two minutes.

I may have seen this when there is syntax error i the code maybe. You should find the error in Postgres log.

@strk
Copy link
Collaborator Author

strk commented Apr 24, 2020 via email

@larsop
Copy link
Owner

larsop commented Apr 24, 2020

Strage, I am running following 4 different versions (mac,centos (dell,huawei), ubuntu)

  • server 12.2
    POSTGIS="3.0.1 ec2a9aa" [EXTENSION] PGSQL="120" GEOS="3.8.1-CAPI-1.13.3" PROJ="7.0.0" LIBXML="2.9.1" LIBJSON="0.11" LIBPROTOBUF="1.0.2" WAGYU="0.4.3 (Internal)" TOPOLOGY

  • server 11.5
    POSTGIS="2.5.3 r17699" [EXTENSION] PGSQL="110" GEOS="3.8.0dev-CAPI-1.12.0 " PROJ="Rel. 6.2.0, September 1st, 2019" GDAL="GDAL 3.1.0dev-7a9a0f4-dirty, released 2019/99/99" LIBXML="2.9.1" LIBJSON="0.11" TOPOLOGY RASTER

  • server 12.1
    POSTGIS="3.0.0 r17983" [EXTENSION] PGSQL="120" GEOS="3.8.0-CAPI-1.13.1 " PROJ="6.3.0" LIBXML="2.9.10" LIBJSON="0.13.1" LIBPROTOBUF="1.3.2" WAGYU="0.4.3 (Internal)" TOPOLOGY

  • And Travis https://travis-ci.org/github/larsop/resolve-overlap-and-gap
    PostgreSQL 11.7 (Ubuntu 11.7-2.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 64-bit

@strk
Copy link
Collaborator Author

strk commented Apr 27, 2020

I'll try again with GEOS-3.8.1 and PostGIS-3.0.1

@strk
Copy link
Collaborator Author

strk commented May 1, 2020

I just tried again:

PostgreSQL 12.2 (Ubuntu 12.2-2.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0, 64-bit
  Postgis 3.0.1 - r3.0.1 - 2020-04-27 13:55:32
  scripts 3.0.1 3.0.1
  GEOS: 3.8.0rc3-CAPI-1.13.1
  PROJ: 7.0.0

Running tests

 ./resolve_overlap_and_gap ...

And there it remains, indefinitely, keeping CPU busy.

It's over 6 minutes now, at time of writing:

=# select now()-backend_start as duration, query from pg_stat_activity where datname = 'nibio_reg';
    duration     |                            query                            
-----------------+-------------------------------------------------------------
 00:06:29.082225 | CALL resolve_overlap_gap_run(                              +
                 | ('test_data.overlap_gap_input_t2','c1','geom',4258,false), +
                 | ('test_topo_t2',0.00001),                                  +
                 | resolve_overlap_data_clean_type_func(                      +
                 | 49,                                                        +
                 | 0,                                                         +
                 | null,                                                      +
                 | 0,                                                         +
                 | 10000,                                                     +
                 | 120,                                                       +
                 | 240,                                                       +
                 | 40,                                                        +
                 | 320                                                        +
                 | )                                                          +
                 | ,5,4);
(1 row)

There must be something else going wrong with the way dataset is created maybe, possibly due to me running multiple times the command ? Any append-always instead of truncate-then-append way to fill dataset ?

@strk
Copy link
Collaborator Author

strk commented May 1, 2020

For reference, Travis runs:

PostgreSQL 11.7 (Ubuntu 11.7-2.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 64-bit
  Postgis 2.5.4 - r - 2020-03-02 10:53:05
  scripts 2.5.4 r0
  GEOS: 3.7.1-CAPI-1.11.1 27a5e771

@strk strk mentioned this issue May 1, 2020
@strk
Copy link
Collaborator Author

strk commented May 1, 2020

I'm trying to understand what the test is supposed to do here.
I see it:

  1. Creates a test_data.overlap_gap_input_t1 table containing simple geometries (polygons)
  2. Populates the test_data.overlap_gap_input_t1 table with 428 records - polygons have a max of 3 rings (1.01 on average); a max of 21 vertexes (8.83 on average)
  3. Creates a test_data.overlap_gap_input_t2 table being a selection of 4 of those 428 records in test_data.overlap_gap_input_t1, having at most 1 ring and 15 vertexes.
  4. Calls resolve_overlap_gap_run against the derived table test_data.overlap_gap_input_t2 (total of 51 vertices, organized in 4 single-ring polygons)

The final call (resolve_overlap_gap_run) is the one which hangs my CPU.

I shall note that the tester ALSO creates a third derivation of the table: test_data.overlap_gap_input_t3. In this third derived table, the geometries are transformed to SRID=25833 (the ones in t2 table are in SRID=4258). The table is still unused at the time resolve_overlap_gap_run is invoked.

@strk
Copy link
Collaborator Author

strk commented May 1, 2020

I've simplified the tester script to be as follows:

DROP table test_data.overlap_gap_input_t2;
CREATE table test_data.overlap_gap_input_t2 AS (SELECT * from test_data.overlap_gap_input_t1 WHERE c1 in (633,1233,1231,834));

SET client_min_messages to NOTICE;

-- Call function to resolve overlap and gap in the function in test_data.overlap_gap_input_t1 which we just testet for overlap
CALL resolve_overlap_gap_run(
('test_data.overlap_gap_input_t2','c1','geom',4258,false), -- TYPE resolve_overlap_data_input
('test_topo_t2',0.00001), -- TYPE resolve_overlap_data_topology
resolve_overlap_data_clean_type_func(  -- TYPE resolve_overlap_data_clean
49,  -- if this a polygon  is below this limit it will merge into a neighbour polygon. The area is sqare meter.
0, -- is this is more than zero simply will called with
null, -- _max_average_vertex_length, in meter both for utm and deegrees, this used to avoid running ST_simplifyPreserveTopology for long lines lines with few points
0, -- IF 0 NO CHAKINS WILL BE DONE A big value here make no sense because the number of points will increaes exponential )
10000, --edge that are longer than this value will not be touched by _chaikins_min_degrees and _chaikins_max_degrees
120, -- The angle has to be less this given value, This is used to avoid to touch all angles.
240, -- OR the angle has to be greather than this given value, This is used to avoid to touch all angles
40, -- The angle has to be less this given value, This is used to avoid to touch all angles.
320 -- OR The angle has to be greather than this given value, This is used to avoid to touch all angles
)
,5,4);                      

Among the NOTICE messages I receive I find this:

sql:BUG12-CPU-busy.sql:23: NOTICE: Failed to open all requested connections 5 , reduce to 0 state : 08001 message: could not establish connection detail : FATAL: database "nibio_reg" does not exist hint : context: SQL statement "SELECT dblink_connect(conntions_array[i], connstr)"
PL/pgSQL function execute_parallel(text[],integer,boolean,text) line 57 at PERFORM
SQL statement "SELECT execute_parallel (stmts_final, _max_parallel_jobs,true)"
PL/pgSQL function resolve_overlap_gap_run(resolve_overlap_data_input_type,resolve_overlap_data_topology_type,resolve_overlap_data_clean_type,integer,integer) line 140 at SQL statement

It does seem to me that my setup makes the dblink calls troublesome, and the code somehow does NOT consider that trouble as a sign of ERROR and not even WARNING but just prints a NOTICE and tries again somewhat later ? It woudl explain the never-ending loop...

@strk
Copy link
Collaborator Author

strk commented May 1, 2020

Found the culprit and reported upstream: larsop/postgres_execute_parallel#3

@strk
Copy link
Collaborator Author

strk commented May 1, 2020

And provided a fix for the infinite loop: larsop/postgres_execute_parallel#4

@strk
Copy link
Collaborator Author

strk commented May 1, 2020

So, now that the problem was found, let's see WHY dbconnect cannot connect.
Logs say:

NOTICE: 7 statements to execute in 5 threads using dbname=nibio_reg

This means that the only connection string is dbname=nibio_reg. There is no port, no username, no password. This is my setup, as I rely on libpq environment variables for those parameters, and it looks like those environment variables are not used in the dblink portion of the code.

The execute_parallel function from postgres_execute_parallel project has an optional _user_connstr parameter, which defaults to NULL. The src/main/sql/function_resolve_overlap_gap_run.sql script in this repository does NOT pass that parameter, so the default is used.

The default, as I observed above, is NULL, which triggers an attempt, by the code of execute_parallel to determine connection string from current connection.
The TCP port in particular is attempted to be extracted using inet_server_port(), but that call in my system returns NULL as well:

=# select inet_server_port();
 inet_server_port 
------------------
                 
(1 row)

As a result, there's only the dbname parameter.

@strk
Copy link
Collaborator Author

strk commented May 1, 2020

For the inability to determine connection string I filed larsop/postgres_execute_parallel#5

@strk
Copy link
Collaborator Author

strk commented May 1, 2020

And here's a fix for that last problem: larsop/postgres_execute_parallel#6
With those 2 PRs merged I finally get a completing make check (completed successfully!)

@larsop
Copy link
Owner

larsop commented May 2, 2020

Thats great.

@larsop larsop closed this as completed May 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants