Remove variant names from index #529

orangejulius · 2021-09-21T16:05:10Z

Who's on First variant names are a useful collection of unofficial names for places, but they tend to be pretty messy. This PR explores the effect of removing them from indexing.

While there might be occasionally useful names in there, it seems like the majority are exact or near duplicates of more official names, or names that are so colloquial that they are not particularly useful (do we really need to support returning NYC for queries for "the big apple"?).

Here are some variant names for some key places, just to record the kind of data that's in there:

NYC:

Bigapple
NY City
NY Cty
New York City
New York Cty
Newyork
Newyorkcity
Novaiorque
Nycity
Thebigapple
Big Apple

San Francisco:

S Francisco
S. Francisco
SFO
Sanfran
Sanfrancisco
Frisco

China:

China - Peoples Republic
China Peoples Rep
China, People's Republic
Chinese
PR China
PR of China
People's Republic of China
Peoples Republic of China

Joxit

LGTM

Paris

Pantruche
Ville-Lumière
Paname
Lutèce

I'm quite amazed, Google does return Paris when you search Ville-Lumière or Paname, maybe an easter egg 😅

missinglink · 2021-09-21T17:04:34Z

I'm 👍 for this, just interested to see if it causes any acceptance test failures before merging.

missinglink · 2021-09-21T17:06:42Z

If this is successful we should open corresponding issues on placeholder/parser/spatial to ensure they follow suit.

I don't remember the history of this but it's likely the convention was copied across to other parts of the codebase.

missinglink · 2021-09-21T17:11:58Z

I thought things like Ville-Lumière, Frisco, Big Apple were supposed to be filed under 'colloquial' 🤷‍♂️

It's seems 'variant' is where toponyms go to die 😆

orangejulius · 2021-09-21T18:23:07Z

If this is successful we should open corresponding issues on placeholder/parser/spatial to ensure they follow suit.

Agreed, I was just thinking about that. The impact of all those extra names is probably even higher for Placeholder since it considers matches across the entire parent hierarchy.

Who's on First variant names are a useful collection of unofficial names for places, but they tend to be pretty messy. This PR explores the effect of removing them from indexing. While there might be occasionally useful names in there, it seems like the majority are exact or near duplicates of more official names, or names that are so colloquial that they are not particularly useful (do we _really_ need to support returning NYC for queries for "the big apple"?). For reference, here are some variant names for some key places, just to record the kind of data that's in there: NYC: ``` Bigapple NY City NY Cty New York City New York Cty Newyork Newyorkcity Novaiorque Nycity Thebigapple Big Apple ``` San Francisco: ``` S Francisco S. Francisco SFO Sanfran Sanfrancisco Frisco ``` China: ``` China - Peoples Republic China Peoples Rep China, People's Republic Chinese PR China PR of China People's Republic of China Peoples Republic of China ```

orangejulius · 2021-09-28T22:21:17Z

Okay, the results from this are in and they look pretty good. There is a decent increase to the overall score of our autocomplete acceptance tests, and a big increase in some other test cases like top_us_cities and us_states. There's almost no difference to test suites that look at addresses, which is expected for a change to the WOF importer.

As far as I can see there are almost no significant regressions from this change. A few individual autocomplete characters here and there, but nothing that looks like a trend.

If I had to summarize, overall it looks like removing variant names has three main positive effects

Removing useless variant names allows desired results to score higher (the extra variant names mean Elasticsearch considers the name field to be longer, and thus gives a lower score)
Desired records that were previously erronously deduplicated due to variant names that happened to mach will now be displayed (the variant name could be either on the undesired record that remained or on the desired record that was removed)
Records that aren't desired are often removed from results, because the variant name that was causing it to be shown in results is now no longer included

Here's some cases that show off one or more of these.

New York, New York

This is a query that has often been tough to get right since there are several results we want near the top, and lots of chances for duplicates or undesirable records to sneak in.

The autocomplete results don't really tell the whole story, here's the results from the query before/after:

The test says that both New York city and county should appear in the results. I'd argue We should add New York State to that list. But in any case the removal of variant names mean that the desired results for the WOF city and county records score higher than before. This boosts them above the East New York locality and the New York City result from Geonames (hopefully we can remove that one completely via deduplication someday).

I think this also shows that when we really fix our scoring in pelias/pelias#862, we'll see even more and better results like this.

Missouri

A common trend in city and state results is fixing issues where a record simply wasn't ever displayed because it would be deduplicated. For example, the state of Missouri would essentially never come up in results because it would be deduped with Missouri Township, MO, which has Missouri in its list of variant names.

Our deduplication code currently prefers more granular results (for example, locality over county or region) in these cases. We might want to make that a little bit more strict with something like pelias/api#1557. A region and a locality with wildly different populations should probably not be considered duplicates if we can avoid that causing issues with places like Berlin.

There's still some deduplication related issues here that we should look at, many of them can be fixed with data updates.

Various cities now showing up earlier in autocomplete

Individually these are all not necessarily amazing changes, but I noticed a decent trend of cities showing up one or two characters earlier in results. When we're talking queries that are only 2-4 characters, that's actually a big deal!

Final notes

I was expecting a big of a decrease in index size for this change, since there are a reasonable number of variant names out there. But it turned out to only be about 5MB. I suppose there might be a slight performance increase because fewer documents will match any given query, but I'm expecting it to not be something we can notice. We should just be able to go by the various improvements and feel confident merging this :)

These pass as a result of removing variant names from WOF in pelias/whosonfirst#529

orangejulius requested review from missinglink and Joxit September 21, 2021 16:05

Joxit approved these changes Sep 21, 2021

View reviewed changes

missinglink approved these changes Sep 21, 2021

View reviewed changes

orangejulius force-pushed the remove-variant-names branch from 32d0ede to 690919e Compare September 23, 2021 20:19

orangejulius merged commit ff4bee8 into master Sep 29, 2021

orangejulius deleted the remove-variant-names branch September 29, 2021 12:44

orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Nov 2, 2021

NYC autocomplete focus point tests pass

4cc6242

These pass as a result of removing variant names from WOF in pelias/whosonfirst#529

orangejulius mentioned this pull request Nov 2, 2021

November test updates pelias/acceptance-tests#554

Merged

This was referenced Nov 4, 2021

Issue with the spelling of a French town - La Queue-lez-Yvelines whosonfirst-data/whosonfirst-data-admin-fr#59

Closed

Issue with the spelling of a French town - La Queue-lez-Yvelines pelias/pelias#918

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove variant names from index #529

Remove variant names from index #529

orangejulius commented Sep 21, 2021

Joxit left a comment

missinglink commented Sep 21, 2021

missinglink commented Sep 21, 2021

missinglink commented Sep 21, 2021

orangejulius commented Sep 21, 2021

orangejulius commented Sep 28, 2021 •

edited

Remove variant names from index #529

Remove variant names from index #529

Conversation

orangejulius commented Sep 21, 2021

Joxit left a comment

Choose a reason for hiding this comment

missinglink commented Sep 21, 2021

missinglink commented Sep 21, 2021

missinglink commented Sep 21, 2021

orangejulius commented Sep 21, 2021

orangejulius commented Sep 28, 2021 • edited

New York, New York

Missouri

Various cities now showing up earlier in autocomplete

Final notes

orangejulius commented Sep 28, 2021 •

edited