expand germanic street abbreviations #68

Merged
merged 7 commits into from Aug 19, 2016

Conversation

Projects
None yet
4 participants
@missinglink
Member

missinglink commented Mar 16, 2016

// expand '-str.' to '-strasse'
// note: '-straße' is only used in Germany, other countries like
// switzerland use 'strasse'.

eg. 'Lindenstr' -> 'Lindenstrasse'

closes pelias/pelias#279

@missinglink missinglink self-assigned this Mar 16, 2016

@missinglink

This comment has been minimized.

Show comment
Hide comment
@missinglink

missinglink Mar 16, 2016

Member
GET /pelias/_search?search_type=count
{
   "aggs": {
      "field_aggs": {
         "terms": {
            "field": "parent.country_a"
         },
         "aggs": {
            "germanic_ending": {
               "filter": {
                  "regexp": {
                     "address.street": "[^ ]+(str)"
                  }
               }
            }
         }
      }
   }
}
{
   "took": 10014,
   "timed_out": false,
   "_shards": {
      "total": 12,
      "successful": 12,
      "failed": 0
   },
   "hits": {
      "total": 279912234,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "field_aggs": {
         "doc_count_error_upper_bound": 39547,
         "sum_other_doc_count": 57718434,
         "buckets": [
            {
               "key": "fra",
               "doc_count": 25905674,
               "germanic_ending": {
                  "doc_count": 5
               }
            },
            {
               "key": "nld",
               "doc_count": 17577224,
               "germanic_ending": {
                  "doc_count": 97
               }
            },
            {
               "key": "deu",
               "doc_count": 14241847,
               "germanic_ending": {
                  "doc_count": 58369
               }
            },
            {
               "key": "cze",
               "doc_count": 5972478,
               "germanic_ending": {
                  "doc_count": 1
               }
            }
         ]
      }
   }
}

The French ones are:

  • 5172 kermestr
  • 5248 kermestr
  • 289 kermestr
  • 82 kerglastr
  • 5025 kerglastr

The Czech one is (I think this is actually DEU but incorrectly attributed as CZE, it's ~1km from the border)

  • 2 Jägershoferstr

The Dutch actually expand differently:

  • Borgercompagniesterstr -> Borgercompagniesterstraat
Member

missinglink commented Mar 16, 2016

GET /pelias/_search?search_type=count
{
   "aggs": {
      "field_aggs": {
         "terms": {
            "field": "parent.country_a"
         },
         "aggs": {
            "germanic_ending": {
               "filter": {
                  "regexp": {
                     "address.street": "[^ ]+(str)"
                  }
               }
            }
         }
      }
   }
}
{
   "took": 10014,
   "timed_out": false,
   "_shards": {
      "total": 12,
      "successful": 12,
      "failed": 0
   },
   "hits": {
      "total": 279912234,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "field_aggs": {
         "doc_count_error_upper_bound": 39547,
         "sum_other_doc_count": 57718434,
         "buckets": [
            {
               "key": "fra",
               "doc_count": 25905674,
               "germanic_ending": {
                  "doc_count": 5
               }
            },
            {
               "key": "nld",
               "doc_count": 17577224,
               "germanic_ending": {
                  "doc_count": 97
               }
            },
            {
               "key": "deu",
               "doc_count": 14241847,
               "germanic_ending": {
                  "doc_count": 58369
               }
            },
            {
               "key": "cze",
               "doc_count": 5972478,
               "germanic_ending": {
                  "doc_count": 1
               }
            }
         ]
      }
   }
}

The French ones are:

  • 5172 kermestr
  • 5248 kermestr
  • 289 kermestr
  • 82 kerglastr
  • 5025 kerglastr

The Czech one is (I think this is actually DEU but incorrectly attributed as CZE, it's ~1km from the border)

  • 2 Jägershoferstr

The Dutch actually expand differently:

  • Borgercompagniesterstr -> Borgercompagniesterstraat
@trescube

This comment has been minimized.

Show comment
Hide comment
@trescube

trescube Mar 16, 2016

Contributor

While I have no objections to corrections to data that the OA team is wary of taking responsibility for, I would like to see such actions take place in a more targeted way. That is, currently data cleanup is happened as soon as the record is read before additional context of admin-lookup is added. Since this change is specifically for German street names, it should probably happen after admin-lookup so that modifications can benefit from knowing the country.

Contributor

trescube commented Mar 16, 2016

While I have no objections to corrections to data that the OA team is wary of taking responsibility for, I would like to see such actions take place in a more targeted way. That is, currently data cleanup is happened as soon as the record is read before additional context of admin-lookup is added. Since this change is specifically for German street names, it should probably happen after admin-lookup so that modifications can benefit from knowing the country.

@missinglink

This comment has been minimized.

Show comment
Hide comment
@missinglink

missinglink Mar 16, 2016

Member

right @trescube agree, I'll move the stream after the admin lookup and only target German addresses.

Member

missinglink commented Mar 16, 2016

right @trescube agree, I'll move the stream after the admin lookup and only target German addresses.

@orangejulius

This comment has been minimized.

Show comment
Hide comment
@orangejulius

orangejulius Mar 16, 2016

Member

This looks like a really good change to have.

We should be careful though, since it will be running across LOTS of data.

First, I like stephen's suggestion of limiting it to just a few countries. The aggregation above has a line "sum_other_doc_count": 57718434, which says there are 57M matching documents not in those 4 countries listed.

Second, we have to make sure that even within those countries, it doesn't do something unexpected. When I initially added the street name cleanup method, I hacked my importer to print out a line each time it made a change, with the old value, and the new one, and then skip the rest of the import pipeline. I ran the resulting output file through sort | uniq -c to see how many unique changes there were (there were only a couple hundred), and looked at each one to make sure it was ok. Let's do the same thing here.

Member

orangejulius commented Mar 16, 2016

This looks like a really good change to have.

We should be careful though, since it will be running across LOTS of data.

First, I like stephen's suggestion of limiting it to just a few countries. The aggregation above has a line "sum_other_doc_count": 57718434, which says there are 57M matching documents not in those 4 countries listed.

Second, we have to make sure that even within those countries, it doesn't do something unexpected. When I initially added the street name cleanup method, I hacked my importer to print out a line each time it made a change, with the old value, and the new one, and then skip the rest of the import pipeline. I ran the resulting output file through sort | uniq -c to see how many unique changes there were (there were only a couple hundred), and looked at each one to make sure it was ok. Let's do the same thing here.

@missinglink missinglink referenced this pull request in pelias/schema Mar 17, 2016

Closed

Refactor autocomplete analysis #109

@orangejulius orangejulius added in progress and removed in review labels Apr 21, 2016

@missinglink missinglink referenced this pull request in pelias/schema Apr 29, 2016

Merged

autocomplete milestone #127

@orangejulius

This comment has been minimized.

Show comment
Hide comment
@orangejulius

orangejulius Jul 7, 2016

Member

I had some time to fool around with this, and made some quick modifications to the import pipeline so that it prints out these changes so we can look at them. There were only 2277 unique changes made, they're attached so we can look at them, but we probably also want to look at what country the changes were made in, since I suspect some of them are acceptable changes in, say, Germany, but not the UK.
changes.txt

Member

orangejulius commented Jul 7, 2016

I had some time to fool around with this, and made some quick modifications to the import pipeline so that it prints out these changes so we can look at them. There were only 2277 unique changes made, they're attached so we can look at them, but we probably also want to look at what country the changes were made in, since I suspect some of them are acceptable changes in, say, Germany, but not the UK.
changes.txt

@orangejulius

This comment has been minimized.

Show comment
Hide comment
Member

orangejulius commented Jul 7, 2016

@trescube

This comment has been minimized.

Show comment
Hide comment
@trescube

trescube Jul 7, 2016

Contributor

👍

Contributor

trescube commented Jul 7, 2016

👍

lib/cleanup.js
+// expand '-str.' to '-strasse'
+// note: '-straße' is only used in Germany, other countries like
+// switzerland use 'strasse'.
+function expandGermanicStreetSuffixes(token) {

This comment has been minimized.

@missinglink

missinglink Aug 3, 2016

Member

is this functionality still required? it seems to be a duplicate of the functions in lib/streams/germanAbbreviationStream.js

@missinglink

missinglink Aug 3, 2016

Member

is this functionality still required? it seems to be a duplicate of the functions in lib/streams/germanAbbreviationStream.js

lib/streams/germanAbbreviationStream.js
+ return record.address_parts.street.replace(/([^\s]+)(s|S)tr\.?$/i,'$1$2træde');
+ }
+ if (_.isEqual(record.parent.country_a, ['MDA'])){
+ return record.address_parts.street.replace(/^([\s]*)(s|S)tr\.?\s/i,'$2trada ');

This comment has been minimized.

@missinglink

missinglink Aug 3, 2016

Member

is the trailing space intentional? '$2trada '

[edit] oh I just realized this is matching anywhere in the text rather than just the last token.

It might be better to use ^ to anchor the search at the start of the string, I worry about what other things this might replace such as other Romanian words which may legitimately end with '-str'.

It would probably be best to target whole tokens that are str or str., let's discuss at standup

@missinglink

missinglink Aug 3, 2016

Member

is the trailing space intentional? '$2trada '

[edit] oh I just realized this is matching anywhere in the text rather than just the last token.

It might be better to use ^ to anchor the search at the start of the string, I worry about what other things this might replace such as other Romanian words which may legitimately end with '-str'.

It would probably be best to target whole tokens that are str or str., let's discuss at standup

lib/streams/germanAbbreviationStream.js
+ var temp = record.address_parts.street.replace(/([^\s]+)(s|S)tr\.?$/i,'$1$2trasse');
+ return temp;
+ }
+ if (_.isEqual(record.parent.country_a, ['NED'])){

This comment has been minimized.

@missinglink

missinglink Aug 3, 2016

Member

we use ISO 3166-1 "alpha 3 codes" so they need to be changed to:

NED -> NLD
DMK -> DNK
@missinglink

missinglink Aug 3, 2016

Member

we use ISO 3166-1 "alpha 3 codes" so they need to be changed to:

NED -> NLD
DMK -> DNK
lib/streams/germanAbbreviationStream.js
+ });
+}
+
+module.exports.create = createGermanAbbStream;

This comment has been minimized.

@missinglink

missinglink Aug 3, 2016

Member

this code could be refactored for readability such as:

var through = require('through2');

// match strings ending in one of: ['str.', 'Str.', 'str', 'Str']
var REGEX_MATCH_STREET_ABBR_SUFFIX = /([^\s]+)(s|S)tr\.?$/i;

// match strings starting with one of: ['str. ', 'Str. ', 'str ', 'Str ']
var REGEX_MATCH_STREET_ABBR_PREFIX = /^([\s]*)(s|S)tr\.?\s/i;

// expand '-str.' to '-strasse'
// note: '-straße' is only used in Germany, other countries like
// switzerland use 'strasse', '-strasse' is also acceptable in Germany.
function expandStrasse(token) {
  return token.replace(REGEX_MATCH_STREET_ABBR_SUFFIX,'$1strasse');
}

// expand '-str.' to '-straat'
function expandStraat(token) {
  return token.replace(REGEX_MATCH_STREET_ABBR_SUFFIX,'$1straat');
}

// expand '-str.' to '-stræde'
function expandStræde(token) {
  return token.replace(REGEX_MATCH_STREET_ABBR_SUFFIX,'$1stræde');
}

// expand '-str.' to '-strada'
function expandStrada(token) {
  return token.replace(REGEX_MATCH_STREET_ABBR_PREFIX,'$1strada ');
}

function expandGermanicStreetSuffixes(record) {

  // country code
  var cc = record.parent.country_a[0];

  // Germany, Switzerland and Austria
  if( 'DEU' === cc || 'CHE' === cc || 'AUT' === cc ){
    return expandStrasse( record.address_parts.street );
  }

  // The Netherlands
  if( 'NLD' === cc ){
    return expandStraat( record.address_parts.street );
  }

  // Denmark
  if( 'DNK' === cc ){
    return expandStræde( record.address_parts.street );
  }

  // Moldova
  if( 'MDA' === cc ){
    return expandStrada( record.address_parts.street );
  }

  // Other countries
  return record.address_parts.street;
}

function createGermanAbbStream(){
  return through.obj(function(record, enc, next){
    record.address_parts.street = expandGermanicStreetSuffixes(record);

    next(null, record);
  });
}

module.exports.create = createGermanAbbStream;
@missinglink

missinglink Aug 3, 2016

Member

this code could be refactored for readability such as:

var through = require('through2');

// match strings ending in one of: ['str.', 'Str.', 'str', 'Str']
var REGEX_MATCH_STREET_ABBR_SUFFIX = /([^\s]+)(s|S)tr\.?$/i;

// match strings starting with one of: ['str. ', 'Str. ', 'str ', 'Str ']
var REGEX_MATCH_STREET_ABBR_PREFIX = /^([\s]*)(s|S)tr\.?\s/i;

// expand '-str.' to '-strasse'
// note: '-straße' is only used in Germany, other countries like
// switzerland use 'strasse', '-strasse' is also acceptable in Germany.
function expandStrasse(token) {
  return token.replace(REGEX_MATCH_STREET_ABBR_SUFFIX,'$1strasse');
}

// expand '-str.' to '-straat'
function expandStraat(token) {
  return token.replace(REGEX_MATCH_STREET_ABBR_SUFFIX,'$1straat');
}

// expand '-str.' to '-stræde'
function expandStræde(token) {
  return token.replace(REGEX_MATCH_STREET_ABBR_SUFFIX,'$1stræde');
}

// expand '-str.' to '-strada'
function expandStrada(token) {
  return token.replace(REGEX_MATCH_STREET_ABBR_PREFIX,'$1strada ');
}

function expandGermanicStreetSuffixes(record) {

  // country code
  var cc = record.parent.country_a[0];

  // Germany, Switzerland and Austria
  if( 'DEU' === cc || 'CHE' === cc || 'AUT' === cc ){
    return expandStrasse( record.address_parts.street );
  }

  // The Netherlands
  if( 'NLD' === cc ){
    return expandStraat( record.address_parts.street );
  }

  // Denmark
  if( 'DNK' === cc ){
    return expandStræde( record.address_parts.street );
  }

  // Moldova
  if( 'MDA' === cc ){
    return expandStrada( record.address_parts.street );
  }

  // Other countries
  return record.address_parts.street;
}

function createGermanAbbStream(){
  return through.obj(function(record, enc, next){
    record.address_parts.street = expandGermanicStreetSuffixes(record);

    next(null, record);
  });
}

module.exports.create = createGermanAbbStream;
lib/streams/germanAbbreviationStream.js
+ return record.address_parts.street.replace(/([^\s]+)(s|S)tr\.?$/i,'$1$2traat');
+ }
+ if (_.isEqual(record.parent.country_a, ['DMK'])){
+ return record.address_parts.street.replace(/([^\s]+)(s|S)tr\.?$/i,'$1$2træde');

This comment has been minimized.

@missinglink

missinglink Aug 3, 2016

Member

Its probably best to discard the case of the S here and just return foostrasse instead of fooStrasse

@missinglink

missinglink Aug 3, 2016

Member

Its probably best to discard the case of the S here and just return foostrasse instead of fooStrasse

@avulfson17

This comment has been minimized.

Show comment
Hide comment
@avulfson17

avulfson17 Aug 3, 2016

Contributor

So, to be clear, we dont want something like Foo Str. to be expanded to Foo Strasse? It seems the only time that you'd get fooStrasse is with the initial string fooStr(.)

Is fooStr a possible way of writing this? It would seem that you'd either add a space or keep the s lower case. If it really is used then changing it is no big deal.

Point number 2. The way the regex for moldova should work is that its anchored at the beginning (with ^) and matches 0 or more spaces after that followed by either S or s followed by t and r and maybe a . and ending with a space. So this regex should only catch str, str., Str., and Str at the beginning of the street name. If this gets used then the space is necessary because the code checks for a space in order to match it

Contributor

avulfson17 commented Aug 3, 2016

So, to be clear, we dont want something like Foo Str. to be expanded to Foo Strasse? It seems the only time that you'd get fooStrasse is with the initial string fooStr(.)

Is fooStr a possible way of writing this? It would seem that you'd either add a space or keep the s lower case. If it really is used then changing it is no big deal.

Point number 2. The way the regex for moldova should work is that its anchored at the beginning (with ^) and matches 0 or more spaces after that followed by either S or s followed by t and r and maybe a . and ending with a space. So this regex should only catch str, str., Str., and Str at the beginning of the street name. If this gets used then the space is necessary because the code checks for a space in order to match it

@missinglink

This comment has been minimized.

Show comment
Hide comment
@missinglink

missinglink Aug 3, 2016

Member

Yes, you are correct in saying we would like Foo Str. to be Foo Strasse, I hadn't thought about that case, and in that case it makes sense to capitalize the street token.

I was referring to this situation, which should return foostrasse (when it's a compound word)

"fooStr".replace(/([^\s]+)(s|S)tr\.?$/i,'$1$2trasse')
"fooStrasse"

It's best to keep the words compound or separate depending on the source, so:

foo str. -> foo strasse
foostr. -> foostrasse

I think I didn't read the Moldava regex properly the first time, I think your version looks good

"Str. foo".replace(/^([\s]*)(s|S)tr\.?\s/i,'$2trada ')
"Strada foo"
Member

missinglink commented Aug 3, 2016

Yes, you are correct in saying we would like Foo Str. to be Foo Strasse, I hadn't thought about that case, and in that case it makes sense to capitalize the street token.

I was referring to this situation, which should return foostrasse (when it's a compound word)

"fooStr".replace(/([^\s]+)(s|S)tr\.?$/i,'$1$2trasse')
"fooStrasse"

It's best to keep the words compound or separate depending on the source, so:

foo str. -> foo strasse
foostr. -> foostrasse

I think I didn't read the Moldava regex properly the first time, I think your version looks good

"Str. foo".replace(/^([\s]*)(s|S)tr\.?\s/i,'$2trada ')
"Strada foo"
@avulfson17

This comment has been minimized.

Show comment
Hide comment
@avulfson17

avulfson17 Aug 3, 2016

Contributor

So would the best way to fix that just be to change
"fooStr".replace(/([^\s]+)(s|S)tr\.?$/i,'$1$2trasse') to
"fooStr".replace(/([^\s]+)(s|\sS)tr\.?$/i,'$1$2trasse') ?

Contributor

avulfson17 commented Aug 3, 2016

So would the best way to fix that just be to change
"fooStr".replace(/([^\s]+)(s|S)tr\.?$/i,'$1$2trasse') to
"fooStr".replace(/([^\s]+)(s|\sS)tr\.?$/i,'$1$2trasse') ?

lib/streams/germanAbbreviationStream.js
@@ -1,20 +1,45 @@
-var _ = require('lodash');
var through = require('through2');

This comment has been minimized.

@orangejulius

orangejulius Aug 10, 2016

Member

Just to keep things clear it would be great to rename this file to germanicAbbreviationStream. A small change but an important one for clarity.

@orangejulius

orangejulius Aug 10, 2016

Member

Just to keep things clear it would be great to rename this file to germanicAbbreviationStream. A small change but an important one for clarity.

+ });
+});
+
+ tape( 'germanStream leaves str in the middle alone', function(test) {

This comment has been minimized.

@orangejulius

orangejulius Aug 10, 2016

Member

same thing german -> germanic

@orangejulius

orangejulius Aug 10, 2016

Member

same thing german -> germanic

+ return record.address_parts.street;
+}
+
+function createGermanAbbStream(){

This comment has been minimized.

@orangejulius

orangejulius Aug 10, 2016

Member

more german -> germanic

@orangejulius

orangejulius Aug 10, 2016

Member

more german -> germanic

+ input_stream.pipe(testedStream).pipe(destination_stream);
+}
+
+tape( 'germanStream expands tokens ending in "-str." to "-strasse" (mostly DEU)', function(test) {

This comment has been minimized.

@orangejulius

orangejulius Aug 10, 2016

Member

german -> germanic

@orangejulius

orangejulius Aug 10, 2016

Member

german -> germanic

@orangejulius

This comment has been minimized.

Show comment
Hide comment
Member

orangejulius commented Aug 17, 2016

LGTM

+var through = require('through2');
+
+// match strings ending in one of: ['str.', 'Str.', 'str', 'Str']
+var REGEX_MATCH_STREET_ABBR_SUFFIX = /([^\s]+)(s|S)tr\.?$/i;

This comment has been minimized.

@trescube

trescube Aug 17, 2016

Contributor

Specifying (s|S) and the i flag is redundant

@trescube

trescube Aug 17, 2016

Contributor

Specifying (s|S) and the i flag is redundant

@trescube

This comment has been minimized.

Show comment
Hide comment
@trescube

trescube Aug 17, 2016

Contributor

otherwise, :shipit:

Contributor

trescube commented Aug 17, 2016

otherwise, :shipit:

@avulfson17

This comment has been minimized.

Show comment
Hide comment
@avulfson17

avulfson17 Aug 17, 2016

Contributor

well the (s|S) captures the case of the first letter, which i want, whereas the i flag doesnt

Contributor

avulfson17 commented Aug 17, 2016

well the (s|S) captures the case of the first letter, which i want, whereas the i flag doesnt

@trescube

This comment has been minimized.

Show comment
Hide comment
@trescube

trescube Aug 17, 2016

Contributor

Sure it does:

> 'abc'.replace(/(a)/i, '$1');
'abc'
> 'Abc'.replace(/(a)/i, '$1');
'Abc'
Contributor

trescube commented Aug 17, 2016

Sure it does:

> 'abc'.replace(/(a)/i, '$1');
'abc'
> 'Abc'.replace(/(a)/i, '$1');
'Abc'
@avulfson17

This comment has been minimized.

Show comment
Hide comment
@avulfson17

avulfson17 Aug 17, 2016

Contributor

wow im dumb, i didnt even think about capturing just the s. i will change it now.

Contributor

avulfson17 commented Aug 17, 2016

wow im dumb, i didnt even think about capturing just the s. i will change it now.

@trescube

This comment has been minimized.

Show comment
Hide comment
@trescube

trescube Aug 17, 2016

Contributor

:shipit:!

Contributor

trescube commented Aug 17, 2016

:shipit:!

@orangejulius orangejulius merged commit 9d4455a into master Aug 19, 2016

3 checks passed

approvals/lgtm this commit looks good
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@orangejulius orangejulius deleted the expand_german_street_abbreviations branch Aug 19, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment