Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[osmium-merge] With 2 input files merge does not clean duplicates #41

Closed
Ignishky opened this issue Feb 8, 2017 · 4 comments
Closed

Comments

@Ignishky
Copy link

Ignishky commented Feb 8, 2017

When I try to merge 2 pbf files containing duplicates, the output still has the duplicates.
merge file1.osm.pbf file2.osm.pbf -o result.osm.pbf

But if I use one of the file a second time, the output does not contains duplicates anymore.
merge file1.osm.pbf file2.osm.pbf file1.osm.pbf -o result.osm.pbf

The issue seems to come from the specific treatment made to process only 2 files, which seems to only do an union between the files.

} else if (m_input_files.size() == 2) {
        // Use simpler code when there are exactly two files to merge
        m_vout << "Merging 2 input files to output file...\n";
        osmium::io::Reader reader1(m_input_files[0], osmium::osm_entity_bits::object);
        osmium::io::Reader reader2(m_input_files[1], osmium::osm_entity_bits::object);
        auto in1 = osmium::io::make_input_iterator_range<osmium::OSMObject>(reader1);
        auto in2 = osmium::io::make_input_iterator_range<osmium::OSMObject>(reader2);
        auto out = osmium::io::make_output_iterator(writer);

        std::set_union(in1.cbegin(), in1.cend(),
                       in2.cbegin(), in2.cend(),
                       out);
    }
@joto
Copy link
Member

joto commented Feb 8, 2017

Strictly speaking you are using osmium merge outside the spec if you have duplicates in any of the input files. Osmium only promises to remove duplicates if they appear in different files, not if the duplicates are in the same file.

I never thought about the case you have. It is kind of difficult to understand the difference and explain this to users. And it is easy to fix by removing the special case you mentioned. So I'll fix it.

@joto
Copy link
Member

joto commented Feb 8, 2017

Seems I was overhasty. This is not so easy to fix. You say it doesn't happen in the three-file case, but that isn't so. Maybe it does in your case, but generally that is not true.

Why do you have files with duplicate entries in the first place?

@Ignishky
Copy link
Author

Ignishky commented Feb 8, 2017

I merge generated files into a single one. It's not data from the OSM site but data in a OSM compliant format.

The files contain adminstrative boundary of different level.

There is no duplicate data in the same file, but the 2 files contains the same nodes and even the same way, like island for exemple.

In those cases, the output contains all the nodes and ways from the 2 inputs file.

(I use osmconvert to be able to grep inside the pbf files)

extract from input file 1 :

grep 'id="22659884631506728' France/pbfFiles/fra.osm                                     
	<way id="22659884631506728" version="1" timestamp="2017-02-08T09:40:38Z" changeset="1" uid="1" user="Toto">

extract fom input file 2 :

grep 'id="22659884631506728' France/pbfFiles/fraf22.osm
	<way id="22659884631506728" version="1" timestamp="2017-02-08T09:41:39Z" changeset="1" uid="1" user="Toto">

extract from output file :

grep 'id="22659884631506728' France/France2.osm
	<way id="22659884631506728" version="1" timestamp="2017-02-08T09:40:38Z" changeset="1" uid="1" user="Toto">
	<way id="22659884631506728" version="1" timestamp="2017-02-08T09:41:39Z" changeset="1" uid="1" user="Toto">

but if I use a second time the fra.osm.pbf file then the output is :

grep 'id="22659884631506728' France/France3.osm         
	<way id="22659884631506728" version="1" timestamp="2017-02-08T09:41:39Z" changeset="1" uid="1" user="Toto">

My thought : as the timestamp is not the same, the set_union use in the 2 files case does not see the 2 ways as identicals. whereas in the 3 files case no set_union is made.

I'm not a cpp dev, but a java one, i might have miss something :)

@joto
Copy link
Member

joto commented Feb 8, 2017

If you have two different variants of the same object but with the same id and version than all bets are off. That's not correct data and I can't guarantee any outcome. I'll clarify this in the man page.

The reason you are seeing the different behaviour is probably that one algorithm uses == comparison and the other <. But this is not something you can rely upon.

joto added a commit that referenced this issue Feb 8, 2017
Mention in man page that object comparison is only done on type, id, and
version.

See #41.
@joto joto closed this as completed Feb 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants