Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

northamerica build and planet build result in different document schema for the same source_id #452

Closed
shekoufa opened this issue Feb 5, 2020 · 6 comments

Comments

@shekoufa
Copy link

shekoufa commented Feb 5, 2020

Hey team!

For testing purposes, we decided to build a north america version of Pelias to be able to geocode US addresses only and we got rrrreally rrrrreally good performance. But then we have a planet build as well, and we tried to run the same addresses through our planet build and this time the performance was not good at all. Not even close to what we got from the north america build.
We were curious to figure out what could cause this degradation in the performance of our planet build and decided to dive into querying the Elastic Search index directly. So, we queried the same source_ids through Kibana on both ElasticSearch instances and we noticed that the north america one has more fields in its schema for that document compared to the planet build. The fields that were missing from the planet build are:
parent.county, parent.county_a, parent.county_id, parent.locality, parent.locality_a, and parent.locality_id

due to these fields not being there in the planet index, the same address that can be geocoded in our north america build, will return a less accurate result in our planet build (up to the city level).

I am wondering, why would the same build process cause the schemas of the two builds significantly different? Oh, another thing we tried was to query the exact address against your api provided through geocode.earth and quite interestingly it returned the exact same response that we got from our own planet build and not an exact match.

For more clarity, I'm going to add example addresses along with the json responses that I get from our north america build and the planet build:

Address: OF WASHINGTON DC 11901 BRADDOCK RD FAIRFAX,VA 22030

north america build's response:

{
    "bbox": [
        -77.357435,
        38.830104,
        -77.357435,
        38.830104
    ],
    "features": [
        {
            "geometry": {
                "coordinates": [
                    -77.357435,
                    38.830104
                ],
                "type": "Point"
            },
            "properties": {
                "accuracy": "point",
                "confidence": 0.8,
                "continent": "North America",
                "continent_gid": "whosonfirst:continent:102191575",
                "country": "United States",
                "country_a": "USA",
                "country_gid": "whosonfirst:country:85633793",
                "county": "Fairfax County",
                "county_a": "FX",
                "county_gid": "whosonfirst:county:102084863",
                "gid": "openaddresses:address:us/va/fairfax:86f19013011f31b2",
                "housenumber": "11901",
                "id": "us/va/fairfax:86f19013011f31b2",
                "label": "11901 Braddock Rd, Fairfax, VA, USA",
                "layer": "address",
                "locality": "Fairfax",
                "locality_gid": "whosonfirst:locality:101728653",
                "match_type": "fallback",
                "name": "11901 Braddock Rd",
                "postalcode": "22030",
                "region": "Virginia",
                "region_a": "VA",
                "region_gid": "whosonfirst:region:85688747",
                "source": "openaddresses",
                "source_id": "us/va/fairfax:86f19013011f31b2",
                "street": "Braddock Rd"
            },
            "type": "Feature"
        }
    ],
    "geocoding": {
        "attribution": "http://localhost:4000/attribution",
        "engine": {
            "author": "Mapzen",
            "name": "Pelias",
            "version": "1.0"
        },
        "query": {
            "lang": {
                "defaulted": true,
                "iso6391": "en",
                "iso6393": "eng",
                "name": "English"
            },
            "parsed_text": {
                "city": "fairfax",
                "number": "11901",
                "postalcode": "22030",
                "query": "of washington dc",
                "state": "va",
                "street": "braddock rd"
            },
            "parser": "libpostal",
            "private": false,
            "querySize": 20,
            "size": 10,
            "text": "OF WASHINGTON DC 11901 BRADDOCK RD FAIRFAX,VA 22030"
        },
        "timestamp": 1580878258343,
        "version": "0.2"
    },
    "type": "FeatureCollection"
}

planet build's response:

{
    "bbox": [
        -77.32554,
        38.80095,
        -77.30637,
        38.84622
    ],
    "features": [
        {
            "geometry": {
                "coordinates": [
                    -77.30637,
                    38.84622
                ],
                "type": "Point"
            },
            "properties": {
                "accuracy": "centroid",
                "confidence": 0.6,
                "continent": "North America",
                "continent_gid": "whosonfirst:continent:102191575",
                "country": "United States",
                "country_a": "USA",
                "country_gid": "whosonfirst:country:85633793",
                "gid": "geonames:locality:4758023",
                "id": "4758023",
                "label": "Fairfax, VA, USA",
                "layer": "locality",
                "locality": "Fairfax",
                "locality_gid": "geonames:locality:4758023",
                "match_type": "fallback",
                "name": "Fairfax",
                "region": "Virginia",
                "region_a": "VA",
                "region_gid": "whosonfirst:region:85688747",
                "source": "geonames",
                "source_id": "4758023"
            },
            "type": "Feature"
        },
        {
            "geometry": {
                "coordinates": [
                    -77.32554,
                    38.80095
                ],
                "type": "Point"
            },
            "properties": {
                "accuracy": "centroid",
                "confidence": 0.6,
                "continent": "North America",
                "continent_gid": "whosonfirst:continent:102191575",
                "country": "United States",
                "country_a": "USA",
                "country_gid": "whosonfirst:country:85633793",
                "gid": "geonames:locality:4758102",
                "id": "4758102",
                "label": "Fairfax Station, VA, USA",
                "layer": "locality",
                "locality": "Fairfax Station",
                "locality_gid": "geonames:locality:4758102",
                "match_type": "fallback",
                "name": "Fairfax Station",
                "region": "Virginia",
                "region_a": "VA",
                "region_gid": "whosonfirst:region:85688747",
                "source": "geonames",
                "source_id": "4758102"
            },
            "type": "Feature"
        }
    ],
    "geocoding": {
        "attribution": "https://localhost/attribution",
        "engine": {
            "author": "Mapzen",
            "name": "Pelias",
            "version": "1.0"
        },
        "query": {
            "lang": {
                "defaulted": true,
                "iso6391": "en",
                "iso6393": "eng",
                "name": "English"
            },
            "parsed_text": {
                "city": "fairfax",
                "number": "11901",
                "postalcode": "22030",
                "query": "of washington dc",
                "state": "va",
                "street": "braddock rd"
            },
            "parser": "libpostal",
            "private": false,
            "querySize": 20,
            "size": 10,
            "text": "OF WASHINGTON DC 11901 BRADDOCK RD FAIRFAX,VA 22030"
        },
        "timestamp": 1580878141779,
        "version": "0.2"
    },
    "type": "FeatureCollection"
}

but for this address: 4000 MERIDIAN BLVD STE 750, FRANKLIN TN 37067
our planet build has all those parent fields that were missing from the previous response. Here's the response for this address from the planet's build:

{
    "bbox": [
        -86.811226,
        35.951859,
        -86.811226,
        35.951859
    ],
    "features": [
        {
            "geometry": {
                "coordinates": [
                    -86.811226,
                    35.951859
                ],
                "type": "Point"
            },
            "properties": {
                "accuracy": "point",
                "continent": "North America",
                "continent_gid": "whosonfirst:continent:102191575",
                "country": "United States",
                "country_a": "USA",
                "country_gid": "whosonfirst:country:85633793",
                "gid": "openaddresses:address:us/tn/williamson:f22fde4f0a37d34b",
                "housenumber": "4000",
                "id": "us/tn/williamson:f22fde4f0a37d34b",
                "label": "4000 Meridian Blvd, Franklin, TN, USA",
                "layer": "address",
                "locality": "Franklin",
                "locality_gid": "whosonfirst:locality:101723093",
                "name": "4000 Meridian Blvd",
                "region": "Tennessee",
                "region_a": "TN",
                "region_gid": "whosonfirst:region:85688701",
                "source": "openaddresses",
                "source_id": "us/tn/williamson:f22fde4f0a37d34b",
                "street": "Meridian Blvd"
            },
            "type": "Feature"
        }
    ],
    "geocoding": {
        "attribution": "https://localhost/attribution",
        "engine": {
            "author": "Mapzen",
            "name": "Pelias",
            "version": "1.0"
        },
        "query": {
            "lang": {
                "defaulted": true,
                "iso6391": "en",
                "iso6393": "eng",
                "name": "English"
            },
            "parsed_text": {
                "admin": "STE 750, FRANKLIN TN",
                "housenumber": "4000",
                "postcode": "37067",
                "region": "STE",
                "street": "MERIDIAN BLVD",
                "subject": "4000 MERIDIAN BLVD"
            },
            "parser": "pelias",
            "private": false,
            "size": 10,
            "text": "4000 MERIDIAN BLVD STE 750, FRANKLIN TN 37067"
        },
        "timestamp": 1580878336140,
        "version": "0.2"
    },
    "type": "FeatureCollection"
}
@missinglink
Copy link
Member

That's interesting, the schema hasn't changed any time recently and the population of fields shouldn't change with the volume of data.

What is suapect is that you're importing different versions of WOF data between the builds and these differences are accounting for the changes.

It's also possible that the code has changed between builds but I looked and couldn't see anything which seemed related (we're working on something right now but it isn't merged yet).

Finally, it could be that your configurations are different, maybe you're running different versions of the docker containers or using different settings in pelias.json?

@missinglink
Copy link
Member

When you say performance are you referring to latency (cpu performance) or result quality?

@missinglink
Copy link
Member

In the future can you please paste your json blobs as pretty printed json. We're volunteering our time and it's very difficult to read a massive blob of text.

@shekoufa
Copy link
Author

shekoufa commented Feb 5, 2020

  • Thanks @missinglink for your answer. Now that I think the versions are definitely different between the two builds and the planet build is more recent. Also, you mentioned other reasons which could be contributing to this. I'm going to carefully try each one and see if it's to blame or not.

  • Sorry for the ambiguity but by performance, I meant result quality.

  • I am so sorry for pasting those jsons like that but I actually paste the pretty printed jsons inside "``" and when I post it (or hit preview to see it before posting) the editor ignores all the new lines.I gotta google this and see if there's a way to keep the new lines there!

Update: Well, I learned something new about GitHub's mark up :D All the jsons are pretty printed now. Sorry again for all the trouble you went through reading those scary lines of json. Never gonna happen again!

@shekoufa
Copy link
Author

shekoufa commented Feb 5, 2020

@missinglink I further investigated this issue, thanks to your helpful reply and I believe I might have found a bug. I will try to describe my test plan with every details so that we can figure this out:
So, to make sure that the version of Pelias was not causing this problem, I created an EC2 instance and pulled the latest images directly from Pelias's docker repo. Then I did the following:

1- Created two Amazon Elastic Search instances (AES) and modified the pelias.json file inside planet and north-america to point to these two instances (one for each). Let's call them AES-planet and AES-na.
2- Noticed the configuration for whosonfirst for the planet project looks like this:

"whosonfirst": {
    "datapath": "/data/whosonfirst",
    "importVenues": false,
    "importPostalcodes": true
}

and the same configuration for north america looks like this:

"whosonfirst": {
    "datapath": "/data/whosonfirst",
    "importPostalcodes": true,
    "importPlace": "102191575"
}

so far, it makes sense because for north america we're just going to download a portion of the whole wof data, hence the importPlace is there. I also checked the codebase, wondering about importVenues which is false for planet but not specified for north america, but then I figured out that if not specified, the default value would be false.

Now, things start to get interesting.
3- I downloaded wof for planet using pelias download wof and then imported it into AES-planet. 3,572,815 documents were indexed, then concerned about the same address mentioned in this issue: OF WASHINGTON DC 11901 BRADDOCK RD FAIRFAX,VA 22030, I accessed the Kibana interface and filtered the data as below:
source: whosonfirst, parent.region: virginia and then searched for the word Fairfax and got 0 results!

4- Did the same steps for north america and got 1,371,851 documents indexed in AES-na. This time, accessed the kibana interface for AES-na and added the same filters and searched for Fairfax and boom, there were about 500 documents returned as the result of the search.

Some differences I noticed in the downloaded files for wof using planet vs. north america.

  • The logs while downloading the data looked way different.
    For north america the logs looked like this:
[whosonfirst-sqlite-download]	 https://dist.whosonfirst.org/sqlite/whosonfirst-data-latest.db.bz2
[whosonfirst-sqlite-decompress]	 /data/whosonfirst/sqlite/whosonfirst-data-latest.db.bz2
[whosonfirst-sqlite-download]	 https://dist.whosonfirst.org/sqlite/whosonfirst-data-postalcode-vi-latest.db.bz2
[whosonfirst-sqlite-download]	 https://dist.whosonfirst.org/sqlite/whosonfirst-data-postalcode-pa-latest.db.bz2
[whosonfirst-sqlite-download]	 https://dist.whosonfirst.org/sqlite/whosonfirst-data-postalcode-sx-latest.db.bz2
[whosonfirst-sqlite-download]	 https://dist.whosonfirst.org/sqlite/whosonfirst-data-postalcode-bl-latest.db.bz2

while in the logs for the planet's attempt at downloading wof, I saw no reference to whosonfirst-sqlite. Here's the first few lines of the logs:

Downloading whosonfirst-data-ocean-latest.tar.bz2 bundle
Downloading whosonfirst-data-marinearea-latest.tar.bz2 bundle
Downloading whosonfirst-data-continent-latest.tar.bz2 bundle
Downloading whosonfirst-data-empire-latest.tar.bz2 bundle
done downloading whosonfirst-data-ocean-latest.tar.bz2 bundle
Downloading whosonfirst-data-country-latest.tar.bz2 bundle
done downloading whosonfirst-data-empire-latest.tar.bz2 bundle
Downloading whosonfirst-data-dependency-latest.tar.bz2 bundle
done downloading whosonfirst-data-marinearea-latest.tar.bz2 bundle
Downloading whosonfirst-data-disputed-latest.tar.bz2 bundle
done downloading whosonfirst-data-disputed-latest.tar.bz2 bundle
Downloading whosonfirst-data-macroregion-latest.tar.bz2 bundle
done downloading whosonfirst-data-dependency-latest.tar.bz2 bundle
Downloading whosonfirst-data-region-latest.tar.bz2 bundle
done downloading whosonfirst-data-continent-latest.tar.bz2 bundle
Downloading whosonfirst-data-macrocounty-latest.tar.bz2 bundle
done downloading whosonfirst-data-macroregion-latest.tar.bz2 bundle
Downloading whosonfirst-data-county-latest.tar.bz2 bundle
done downloading whosonfirst-data-macrocounty-latest.tar.bz2 bundle
Downloading whosonfirst-data-macrocounty-latest.tar.bz2 bundle
done downloading whosonfirst-data-macrocounty-latest.tar.bz2 bundle
Downloading whosonfirst-data-localadmin-latest.tar.bz2 bundle
done downloading whosonfirst-data-country-latest.tar.bz2 bundle
Downloading whosonfirst-data-locality-latest.tar.bz2 bundle
  • Another difference I noticed is that after the wof download finished for north-america, there was a sqlite folder inside /data/whosonfirst containing a bunch of DB files but the same folder was non-existant when the wof download finished for the planet.

I hope these details could help here to figure out what is going on. I could simply just be me forgetting to do a step for the planet build, or it could be an existing bug.

I also want to add this that the performance (accuracy in geocoding addresses) of our north america build for the same 200 addresses is around 90% which is amazing but due to the problem mentioned here our performance for the planet build is at around 60%.

@orangejulius
Copy link
Member

Hi folks,
Just came across this old issue. It was caused by corrupt data hosted by the old Who's on First data download service, since pelias/whosonfirst#487 back in April Geocode Earth is building and hosting this data, (and it is corruption free! :) ), and all importers should use that data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants