processing never finishes on small dataset #38

gplv2 · 2017-10-27T12:41:35Z

Hi,

I'm having a strange issue with ogr2osm. We are using it to prepare the Belgian open address database shapefile for loading into PSQL using OSM toolset in the following manner:

Download of the address list:
https://downloadagiv.blob.core.windows.net/crab-adressenlijst/Shapefile/CRAB_Adressenlijst.zip

It's not too big imho. And this used to work without a glitch a few months ago (it's automated), both the code of ogr2osm and the wrapper script have not changed since then (the data however did of course)

After zip extraction we first use ogr2ogr on it:

/usr/local/bin/ogr2ogr -s_srs EPSG:31370 -t_srs EPSG:4326 CrabAdr_parsed CRAB/Shapefile/CrabAdr.shp -overwrite

that step works, then we use ogr2osm like:

/usr/local/bin/ogr2osm/ogr2osm.py --idfile=ogr2osm.id --positive-id --saveid=ogr2osm.id CrabAdr_parsed/CrabAdr.shp

This keeps on going until we reach this state of the machine:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
30559 glenn     20   0 15.793g 0.015t  26116 R 100.0 60.7  24:56.03 python

The machine still has lots of memory available:

KiB Mem : 26754664 total,   789720 free, 16448580 used,  9516364 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  3797372 avail Mem

the memory growth stops at that point, 1 cpu is 100% busy and it never stops , I left it running for over 18 hours (which is already abnormal as it used to be less than half hour). So it hangs.

I can't seem to strace this process either, I've never seen this error message before when stracing it:

root@grb-db-0:/usr/local/src/grb# strace -fp 26878
strace: Process 26878 attached
strace: [ Process PID=26878 runs in x32 mode. ]

and then it stays silent. I've never seen that x32 mode messages before and I've been around unix for more than 20 years.

CTRL-C works however and shows this :

Traceback (most recent call last):
  File "/usr/local/bin/ogr2osm/ogr2osm.py", line 723, in <module>
    mergePoints()
  File "/usr/local/bin/ogr2osm/ogr2osm.py", line 559, in mergePoints
    parent.replacejwithi(pointsatloc[0], point)
  File "/usr/local/bin/ogr2osm/geom.py", line 65, in replacejwithi
    j.removeparent(self)
  File "/usr/local/bin/ogr2osm/geom.py", line 23, in removeparent
    Geometry.geometries.remove(self)
KeyboardInterrupt

which tells me this happens in mergePoints() function

I also tried using python 3.5 instead of 2.7 , same symptom.

This is part of an automated toolstack using terraform with google cloud to crunch the data, you could reproduce the entire thing building an exact same machine as we are using now with the repository below

https://github.com/gplv2/crab-osm-qa

The bash script that contains this code is here : https://github.com/gplv2/crab-osm-qa/blob/master/helpers/process_source.sh

Could you shed your 2 cents on this issue please ? I've always had success using ogr2osm tool, in fact, in the same script, we parse the open belgian road database as well and this passes fine. It's just the address database that is showing this behavior.

Would love to get some suggestions at this point. Appreciate this a lot. Thank you for your work as well, it's proven to be essential for the Belgian OSM community.

Greetings,

Glenn

The text was updated successfully, but these errors were encountered:

gplv2 · 2017-10-27T14:38:37Z

Forgot to mention, the output goes as far as :

l.debug("Checking list")

So it must be happening after this message, which of course you can deduct from the backtrace above, so perhaps this was not needed

gplv2 · 2017-10-28T18:03:06Z

I've been debugging this a bit further, python is not my forte though I've added some debug statements. Turns out, we have in mergePoints :

Total points user : 3 508 945 (count of points variable)
Total points coord: 2 527 003 (count of pointcoords variable)

It takes a very long time to process the first 5000 points, unusually long imho:

for (location, pointsatloc) in pointcoords.items():

There are also quite some duplicates present in this dataset so it has to work hard. It doesn't make a lot of sense that is is so slow. I'll hack on this a bit more to find out where the performance hog is.

When we parse the road database, it contains a lot more points :

Merging points
Total points user : 8082689
Making list
Total points coord 8082689

But it seems we don't have any duplicates, so it goes really fast according to te debug logging. But the memory footprint is exactly the same as when we parse the addresses data.

roelderickx · 2020-10-12T11:01:12Z

The bottleneck is on line 23 of geom.py, which is only executed for a duplicate node:
Geometry.geometries.remove(self)
When there are few or no duplicates the performance drawback isn't really noticeable, even if you have millions of nodes. But in this case you have around 1 million duplicates which should all be searched for in a non-hashed list before removal.
On my computer it takes on average 0.035 seconds to process each unique coordinate (regardless of existing duplicates or not). That may not seem like much, but with 2,5 million unique coordinates the process ends up taking > 24 hours.
Which brings us to a more important question: why add elements to a list if you want to remove them later, without ever having used them? I am convinced that mergPoints can be integrated in parseData, that should significantly improve the performance.

roelderickx · 2020-10-14T10:18:58Z

The changes work in roelderickx/ogr2pbf, processing time is down to around 5 minutes. I'll try to backport the changes to ogr2osm and create a pull request.
However, I see you are using a fork now where mergePoints is disabled, which seems to work for what you want to do. In that case you are probably affected by issue #51 as well.

gplv2 · 2020-10-22T20:20:08Z

Hey, thanks a lot for looking into to this and bringing #51 to my attention. It's been a while that I hacked on this although the tool still exists and is in use. Really cool you took the time for this.

ogr2pbf is one of the tools in the chain to prepare data for human assisted import into osm via josm.

https://staging.grbosm.site/#/ (zoom low enough and on north part of Belgium for the layer to get pulled from postgres)

Afaik, I solved it by just living with the duplicates and later on in the chain of preprocessing the data it got solved , but I don't exactly remember how.

Anyway, pretty soon I'll be doing a fresh dataprocessing run which is entirely automated in fact, I will give it ago once it's backported and replace my fork , so it gets tested. The whole preprocessing of the data takes about 6 hrs on a decent google cloud node.

Big thanks Roel.

roelderickx mentioned this issue Oct 24, 2020

Improve performance by eliminating mergePoints #54

Merged

roelderickx closed this as completed Apr 18, 2021

roelderickx mentioned this issue May 24, 2021

python killed #36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

processing never finishes on small dataset #38

processing never finishes on small dataset #38

gplv2 commented Oct 27, 2017

gplv2 commented Oct 27, 2017 •

edited

Loading

gplv2 commented Oct 28, 2017 •

edited

Loading

roelderickx commented Oct 12, 2020

roelderickx commented Oct 14, 2020

gplv2 commented Oct 22, 2020

processing never finishes on small dataset #38

processing never finishes on small dataset #38

Comments

gplv2 commented Oct 27, 2017

gplv2 commented Oct 27, 2017 • edited Loading

gplv2 commented Oct 28, 2017 • edited Loading

roelderickx commented Oct 12, 2020

roelderickx commented Oct 14, 2020

gplv2 commented Oct 22, 2020

gplv2 commented Oct 27, 2017 •

edited

Loading

gplv2 commented Oct 28, 2017 •

edited

Loading