New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent topology population introduce invalidities #16
Comments
I've read somewhere about deadlocks occurring, @larsop can you explain why this happens ? |
First I do block on cell in content based grid table, since this is a very small table. Basically what I do here is to block all cells that intersects with that area that may be affected the current update. So if anther update at the same time intersects with some of the same cells it will just wait and try later and another job will start instead. My hope with it doing it this way was to avoid to do row locks on very big tables and then totally avoid the next step. In the next step I start to block rows in the topology tables. The reason why I this in many steps is that did not know about the thats it's best to this one single transaction. |
We're trying to find the root cause of the issue here, so increasing the complexity is the wrong way to go. We need to simplify as much as possible. See the description in https://trac.osgeo.org/postgis/ticket/4684#comment:12 -- can you craft a test in THIS repository to use THAT data ? |
OK, I will that. |
Another thing to do would be having the |
Here is sample but it seems to work ok
|
I done update in this commit where I only catch and log topo error when adding lines to the final Topology result. When running this test
Is was 12 lines that was not added with this errors. All these errors was related to stage 3, which means adding lines that makes a connection between lines from different cells added in stage 2. Stage 3 is here running multithreaded.
When running validation (I did not wait for akk results)
If I try to return the command by hand I get the same error
|
One more comment here, where the error happens and how many seems to be kind of random or maybe depend on number of parallel threads. I did a test now with 28 parallel threads instead of 25 and then I got this list of errors. I will also run a test 10 threads to if this reduce the number of errors.
With 10 threads we got 17 errors not 39 errors as with 28 threads , but the the first run with 25 threads we got 11 errors so it seems to more random than connected to number threads. But 10 threads we did not see any deadlocks.
|
On Tue, Jun 23, 2020 at 07:50:15AM -0700, Lars Aksel Opsahl wrote:
Here is sample but it seems to work ok
It's not easy to control the timing of parallel operations.
The perform_execute_parallel function makes no effort to
try at doing this, so we need to use something different.
What we need is a way to ensure both backends query the
the database _before_ updating it.
perform execute_parallel(stmts,2,true);
Given we're using small geometries, can you ensure
the second "thread" (process) is really started
*before* the first one commits its transaction ?
Can add a pg_sleep() be added as part as the "same statement" ?
|
Both the "Side-location conflict" and the (obviously) "Corrupted topology" errors indicate an invalid topology. Is this happening against a topology that passes the ValidateTopology test ? If so, there are two possibilities:
To determine both occurrences it would help to receive a dump of the starting topology. |
Per point 1 above (ValidateTopology failing to catch invalidities) see https://trac.osgeo.org/postgis/ticket/3042 |
Yes, you where right adding a sleep made the test fail and we git this result. ?column? | error | id1 | id2
|
I have now change code to stop after exception and if the number of Topology error are not zero in this commit if setting the parameter for this. This code is running now. The validation is taking log time so I only do validation if get an exception. I will try to dump the database to you as soon as it stops
The input simple data has this size, table_schema | table_name | total_size | data_size | external_size |
No I got a exception and invalid topology when running the test below.
|
In the table test_topo_ar50_t7.ar50_utvikling_flate_no_cut_line_failed there is one entry with this error.
|
On Fri, Jun 26, 2020 at 08:51:06PM -0700, Lars Aksel Opsahl wrote:
> Given we're using small geometries, can you ensure the second "thread" (process) is really started *before* the first one commits its transaction ? Can add a pg_sleep() be added as part as the "same statement" ?
Yes, you where right adding a sleep made the test fail and we git this result.
(Running in one thread no errors)
?column? | error | id1 | id2
------------+------------------+-----+-----
validation | face within face | 1 | 2
validation | face within face | 2 | 1
validation | face within face | 1 | 3
validation | face within face | 2 | 3
```
Great! This will be very useful test for upstream as well
(although upstream will not want to have execute_parallel dependency)
You should be able to further simplify it by using WKT (easier to read
by humans) and drop the SRID (not really needed for this problem).
Now: does the pessimistic row-level locking make this test pass ?
(yes I do understand the data is way too simple to have all corner
cases).
|
I did a test now on ar50 data_set
and the ' select * from topology.ValidateTopology('test_topo_ar50_t7')' returned no rows. after extend the blocking area. I will continue to do more checking here, just confirm that I still have parallel queries running at stage 3. |
Please remember that SOME invalidities are currently NOT caught by ValidateTopology: https://trac.osgeo.org/postgis/ticket/3042 |
Sad to say we did get invalid topologies when running job_type 3 in multithreaded mode after a couple of mere runs. |
I have done some more testing on layer that have two areas that are not spatially connected. When adjusting the blocking area so it's always cover all off one of the spatial areas, we don't get any topo errors. This is tested 3 times now. So the problem is find the correct blocking area , I will do some more testing on this. |
I am now trying to compute blocking area based the input data. |
I have been running 5 times with this and I have not seen any topo error yet. I also did a test with no row locking and only spatial blocking then I got 5 topo errors. So the combination of row level and spatial blocking seems to work now. Sandro has had two comments and I will check out then now
I will also try to a test with only row level locking to see how that works out. |
Here I have tried resolve case 1 og 2. And I have also reduced the spatial blocking area. Row level locking in Postgres seems to be working very well if I lock many thousand rows og 30 threads in parallel. It takes some hours to test this, because just validate it takes more than one hour. |
In this test_add_line_do_row_level_lock.txt file I have done row level like here and the topo errors we had are gone. |
It would be great to attach the testcase to ticket
https://trac.osgeo.org/postgis/ticket/4684
One day we might implement row-level locking in PostGIS Topology
itself...
|
Sure, I will do that. |
While PostGIS Topology is known to not support concurrent population, as described in https://trac.osgeo.org/postgis/timeline?from=2020-05-08T06%3A13%3A24-07%3A00&precision=second, we want to try at working around this problem by ensuring a single transaction will ever only work on the same edges, faces and nodes at the same time.
This ticket is to track such task in the appropriate repository (here).
Work is undergoing about this in the trac_osgeo_org_postgis_ticket_4684 branch
The text was updated successfully, but these errors were encountered: