[osmium-merge] With 2 input files merge does not clean duplicates #41

Ignishky · 2017-02-08T11:18:50Z

When I try to merge 2 pbf files containing duplicates, the output still has the duplicates.
merge file1.osm.pbf file2.osm.pbf -o result.osm.pbf

But if I use one of the file a second time, the output does not contains duplicates anymore.
merge file1.osm.pbf file2.osm.pbf file1.osm.pbf -o result.osm.pbf

The issue seems to come from the specific treatment made to process only 2 files, which seems to only do an union between the files.

} else if (m_input_files.size() == 2) {
        // Use simpler code when there are exactly two files to merge
        m_vout << "Merging 2 input files to output file...\n";
        osmium::io::Reader reader1(m_input_files[0], osmium::osm_entity_bits::object);
        osmium::io::Reader reader2(m_input_files[1], osmium::osm_entity_bits::object);
        auto in1 = osmium::io::make_input_iterator_range<osmium::OSMObject>(reader1);
        auto in2 = osmium::io::make_input_iterator_range<osmium::OSMObject>(reader2);
        auto out = osmium::io::make_output_iterator(writer);

        std::set_union(in1.cbegin(), in1.cend(),
                       in2.cbegin(), in2.cend(),
                       out);
    }

The text was updated successfully, but these errors were encountered:

joto · 2017-02-08T13:07:00Z

Strictly speaking you are using osmium merge outside the spec if you have duplicates in any of the input files. Osmium only promises to remove duplicates if they appear in different files, not if the duplicates are in the same file.

I never thought about the case you have. It is kind of difficult to understand the difference and explain this to users. And it is easy to fix by removing the special case you mentioned. So I'll fix it.

joto · 2017-02-08T13:34:50Z

Seems I was overhasty. This is not so easy to fix. You say it doesn't happen in the three-file case, but that isn't so. Maybe it does in your case, but generally that is not true.

Why do you have files with duplicate entries in the first place?

Ignishky · 2017-02-08T14:42:57Z

I merge generated files into a single one. It's not data from the OSM site but data in a OSM compliant format.

The files contain adminstrative boundary of different level.

There is no duplicate data in the same file, but the 2 files contains the same nodes and even the same way, like island for exemple.

In those cases, the output contains all the nodes and ways from the 2 inputs file.

(I use osmconvert to be able to grep inside the pbf files)

extract from input file 1 :

grep 'id="22659884631506728' France/pbfFiles/fra.osm                                     
	<way id="22659884631506728" version="1" timestamp="2017-02-08T09:40:38Z" changeset="1" uid="1" user="Toto">

extract fom input file 2 :

grep 'id="22659884631506728' France/pbfFiles/fraf22.osm
	<way id="22659884631506728" version="1" timestamp="2017-02-08T09:41:39Z" changeset="1" uid="1" user="Toto">

extract from output file :

grep 'id="22659884631506728' France/France2.osm
	<way id="22659884631506728" version="1" timestamp="2017-02-08T09:40:38Z" changeset="1" uid="1" user="Toto">
	<way id="22659884631506728" version="1" timestamp="2017-02-08T09:41:39Z" changeset="1" uid="1" user="Toto">

but if I use a second time the fra.osm.pbf file then the output is :

grep 'id="22659884631506728' France/France3.osm         
	<way id="22659884631506728" version="1" timestamp="2017-02-08T09:41:39Z" changeset="1" uid="1" user="Toto">

My thought : as the timestamp is not the same, the set_union use in the 2 files case does not see the 2 ways as identicals. whereas in the 3 files case no set_union is made.

I'm not a cpp dev, but a java one, i might have miss something :)

joto · 2017-02-08T15:12:56Z

If you have two different variants of the same object but with the same id and version than all bets are off. That's not correct data and I can't guarantee any outcome. I'll clarify this in the man page.

The reason you are seeing the different behaviour is probably that one algorithm uses == comparison and the other <. But this is not something you can rely upon.

Mention in man page that object comparison is only done on type, id, and version. See #41.

joto added a commit that referenced this issue Feb 8, 2017

Extend osmium-merge man page.

8f7a649

Mention in man page that object comparison is only done on type, id, and version. See #41.

joto closed this as completed Feb 15, 2017

joto mentioned this issue May 9, 2017

merge: different behavior of two-way and n-way merge #57

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[osmium-merge] With 2 input files merge does not clean duplicates #41

[osmium-merge] With 2 input files merge does not clean duplicates #41

Ignishky commented Feb 8, 2017

joto commented Feb 8, 2017

joto commented Feb 8, 2017

Ignishky commented Feb 8, 2017

joto commented Feb 8, 2017

[osmium-merge] With 2 input files merge does not clean duplicates #41

[osmium-merge] With 2 input files merge does not clean duplicates #41

Comments

Ignishky commented Feb 8, 2017

joto commented Feb 8, 2017

joto commented Feb 8, 2017

Ignishky commented Feb 8, 2017

joto commented Feb 8, 2017