Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postal code between city and province in Canadian addresses #250

Open
ghost opened this issue Sep 20, 2017 · 2 comments
Open

Postal code between city and province in Canadian addresses #250

ghost opened this issue Sep 20, 2017 · 2 comments
Labels

Comments

@ghost
Copy link

ghost commented Sep 20, 2017

If address is:
"6631 Island Highway North, Nanaimo, BC, V9T 4T7"
North is correctly recognized as street direction, and Nanaimo as city.
correct

But if the address was:
"6631 Island Highway North, Nanaimo, V9T 4T7, BC"
It returns null in street direction and city.
wrong

@albarrentine
Copy link
Contributor

Not familiar with that format. Is that a common way to write an address in Canada or are you just trying to test different orderings of the input?

Libpostal trains on a variety of address formats/postcode placements but that's not a frequent one, so the parser has probably never seen the city->postcode->state transition with that city before. When building the training data, it's possible to swap the components around at random with certain small probabilities which allows the parser to handle non-standard formats, but generally we don't worry so much about one-off edge cases, only patterns that locals would be expected to use.

In this specific case, the parser also has to contend with the fact that there's a neighborhood (from Quattroshapes) called "North Nanaimo" which sometimes gets added to the training data, so the phrase "North Nanaimo" looks like a suburb. The parser ignores commas in phrase search, though it always has the option to break up a phrase or treat it as another tag since e.g. "Nanaimo" doesn't always have to be a city, could be part of a POI name as in "Nanaimo Fire Station". Still, choosing when to break up a phrase correctly requires plenty of examples, and grepping through the training data I'm not seeing too many in Nanaimo where postcode is present.

I'd like to confirm that that format's a reasonably common thing for locals to write (again we're mostly concerned with the formats people use day-to-day, not every possible format they could use). If so, I'll attempt to create more examples of this pattern in the training data to fix it in the next release - no promises though: machine learning code is not written deterministically so one can only try to encourage the model to do the right thing, can't directly program it to, and as I said it's a relatively difficult case to train in a smaller city without too many examples. If the format's not common, I'd say just write it off as part of the 0.5% of addresses libpostal gets wrong. There has to be some tolerance for errors when using ML-based systems like this.

One other possible solution that can be implemented in code rather than data is requiring that known-phrase search respect comma boundaries in the input so it wouldn't even have to disambiguate the "North, Nanaimo" portion of the address from addresses with "North Nanaimo" the neighborhood in this case, though only when the comma structure is present. Have opted to ignore commas in other parts of the model so it doesn't come to rely on comma-delimited input, but using commas in phrase search could help performance across the board if commas are present and wouldn't hurt (though an edge case that comes to mind is sort names like "Korea, Republic of", so legitimate commas would have to be handled).

Also, are the above outputs coming directly from libpostal? Our model doesn't separate street type, direction, etc. Is this a custom model or is it doing some postprocessing on libpostal's output?

@albarrentine albarrentine changed the title When Postal Code is before Province, it can lead to failed street direction and Postal code between city and province in Canadian addresses Sep 20, 2017
@ghost
Copy link
Author

ghost commented Sep 21, 2017

Hi @thatdatabaseguy, thanks for the quick but detailed answer

Yes we built street type and direction based on libpostal's results.

The 2 cases I mentioned, they were from actual business listings that we see on the net. Here are their raw output from libpostal:

  1. 6631 Island Highway N,-, Nanaimo BC V9T 4T7, Canada
    https://foursquare.com/v/aw/56e1f394498ef010bf04a604
    Result: everything's correct
    {
    "house_number": "6631",
    "road": "island highway n",
    "city": "nanaimo",
    "state": "bc",
    "postcode": "v9t 4t7",
    "country": "canada"
    }

  2. 6631 Island Highway N- , Nanaimo V9T 4T7, BC
    https://www.hotfrog.ca/business/bc/nanaimo/a-w_6159874
    6631 Island Highway N, Nanaimo, V9T 4T7, BC, Canada
    http://www.tupalo.net/en/nanaimo-british-columbia/a-and-w-island-highway-north

Result: N wrongly attributed to suburb with Nanaimo
{
"house_number": "6631",
"road": "island highway",
"suburb": "n nanaimo",
"postcode": "v9t 4t7",
"state": "bc"
}

As for the question regarding if it's a common thing to put Province after Postal Code, yes it happens. Here are some more examples:

====================
19705 Fraser Highway (Food Court) 
Langley, V3A 7E9
BC, Canada
http://www.tupalo.net/en/langley-british-columbia/a-and-w-fraser-highway

11900 Haney Pl(Food Court) , Maple Ridge V2X 8R9, BC 
https://www.hotfrog.ca/business/bc/maple-ridge/a-w_6161497

3211 Grant McConachie Level 3 , Richmond V7B 0A5 , BC Canada
https://www.hotfrog.ca/business/bc/richmond/a-w_6159889

6551 No. 3 Rd # 1202 , Richmond V6Y 2B6 , BC Canada
https://www.hotfrog.ca/business/bc/richmond/a-w-restaurant_4209733

Sometimes, rarely though, people don't put province at all:

====================
215 Water Street
A1C 6C9 St. John's
https://www.opendi.ca/st-johns/847654.html

42 HIGH STREET GRAND FALLS-WINDSOR ,A2A2M4
http://www.brownbook.net/business/39168658/bdc---business-development-bank-of-canada

And this guy put postal code even before city. I guess it's acceptable too from a human's perspective, since postal code is so obvious that we are trained to recognize it no matter where it is in the address:

====================
42 High Street
A2A 1C6 Grand Falls-Windsor
https://www.opendi.ca/grand-falls-windsor/847649.html

My personal impression is that if the position of commas mattered, these cases might not have resulted in errors.

Thank you! Cheers.

albarrentine added a commit that referenced this issue Sep 23, 2017
…ges in all countries, and a much higher chance of generating city->postcode->state in Canada for #250, French and English territories inherit address formats
Jeffrey04 pushed a commit to Jeffrey04/libpostal that referenced this issue Nov 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant