Skip to content
This repository has been archived by the owner on Apr 28, 2018. It is now read-only.

Matchup yelp IDs with TA IDs not in Crosswalk #25

Closed
jhugman opened this issue Oct 24, 2016 · 8 comments
Closed

Matchup yelp IDs with TA IDs not in Crosswalk #25

jhugman opened this issue Oct 24, 2016 · 8 comments

Comments

@jhugman
Copy link
Contributor

jhugman commented Oct 24, 2016

Because Factual Crosswalk isn't reliable.

This would be a regular crawl, disconnected with a user hitting the the prox-server API.

This would also include add new Crosswalk records to the Factual database.

@jhugman
Copy link
Contributor Author

jhugman commented Oct 24, 2016

@jhugman
Copy link
Contributor Author

jhugman commented Oct 24, 2016

@jhugman jhugman changed the title Matchup yelp IDs with TripAdivsor IDs Matchup yelp IDs with TripAdivsor IDs not in Crosswalk Oct 24, 2016
@mcomella mcomella added this to the Sprint #6 – fix place data milestone Jan 24, 2017
@mcomella mcomella changed the title Matchup yelp IDs with TripAdivsor IDs not in Crosswalk Matchup yelp IDs with TA IDs not in Crosswalk, wikipedia, websites, etc. Jan 24, 2017
@mcomella mcomella self-assigned this Jan 24, 2017
@mcomella
Copy link
Contributor

@mcomella mcomella changed the title Matchup yelp IDs with TA IDs not in Crosswalk, wikipedia, websites, etc. Matchup yelp IDs with TA IDs not in Crosswalk Jan 24, 2017
@mcomella
Copy link
Contributor

mcomella commented Jan 24, 2017

Steps:

  1. Are we running out of Factual calls during the crawl and missing data because of that? (See also Create tooling so we know when we hit API limits in crawl script #88)
  2. Figure out which Yelp places are missing TA data (re-use No issue: Add scripts/places_missing_data script. #86 ? Or maybe this is dependent on ^)
  3. Create automation that matches Yelp -> TA (search by place name & gps in TA, do fuzzy name match to confirm). If it's correct often enough, use it, else...
  4. Do ^, but have a human check over that the results are correct (and allow them to add in corrections).

As James mentions, it'd be great to give back our findings to Factual.

mcomella added a commit to mcomella/prox-server that referenced this issue Jan 26, 2017
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 26, 2017
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 26, 2017
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 26, 2017
@mcomella
Copy link
Contributor

mcomella commented Jan 26, 2017

I discovered we can put and into TA's location_mapper API and it'll pass back IDs – no dealing with messy search!

In my first experiment, centered on the Ferry Building, 15/50 (30%) places were missing TA data:

  • 8 are roughly uncorrectable (16% of the total)
    • 6 are not in TA
    • 2 have incorrect data in TA leading to mismatches ("Barbarossa" is named "Bubble Lounge" & TA thinks "Cafe Algiers" is closed)
  • 2 may be a bug in my code (yay unicode) but the added accents are not in TA but are in Yelp so that could cause name mismatches (correctable if we remove accents before searching)
  • 5 names are slightly different from Yelp to TA, causing a miss (e.g. "Mariposa Baking" on Yelp vs. "Mariposa" on TA).
    • To correct these, we could make a search by coordinates without place names (assuming the API works this way), take the results, and do our own name matching

The remaining 35 places were correct.

The full analysis is in a gist.

Next TODO:

  • Try to fix unicode bug
  • (?) Make this verification easily reproducible
  • Verify results in other locations are consistent (e.g. 30% missing)
  • If consistent, try to correct for name mismatches
  • (Make code easily runnable, finding missing places from cache, getting our own crosswalk for these places, and updating the cache with these places)

@mcomella
Copy link
Contributor

mcomella commented Jan 27, 2017

My current implementation puts the info directly into TA's location_mapper, taking the first result, and, if there are no results, strips any accents and tries again. A possible improvement is to remove the name query and do our own place matching.

With the current implementation, using the 50 best match Yelp results with our top level categories from an 800m radius around the following locations, I got the following results:

  • Ferry Building SF
    • 72% (36/50) correct matches
    • 5 name mismatches
    • 6 places not on TA
    • 2 incorrect data in TA causing mismatch
    • 1 food truck (so matching location isn't really possible)
  • YVR office
    • 82% (41/50) correct matches (1 place had different address between the two services)
    • 8 name mismatches
    • 1 place not on TA
  • Cloud Gate in Chicago
    • 78% (39/50) correct matches
    • 9 name mismatches
    • 2 places not on TA

Notes:

  • "Best match" will likely be more prominent locations so there's more likely to be TA matches (I figured these are the locations we'd want to surface anyway).
  • Bolded name mismatches are potentially correctable (with the improvement mentioned above), but could introduce more error
  • My raw notes (with specific name mismatches & sfo, yvr., chi place list) can be found in this gist.
  • I've been storing a list of name mismatches in docs/yelp_at_name_mismatches.yml

Updated TODO:

  • We should compare our success rates against factual crosswalk (to have a metric of improvement).
    • If we haven't improved much, consider the improvement mentioned above ^
  • ? Test on distance sort, rather than best match, for a more realistic test
  • Find places missing TA data from place cache, store their crosswalk
  • Figure out how to merge crosswalk ^ into main code base

mcomella added a commit to mcomella/prox-server that referenced this issue Jan 27, 2017
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 27, 2017
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 27, 2017
…add improvements notes.

See code comments for details and improvement notes.
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 28, 2017
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 28, 2017
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 28, 2017
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 28, 2017
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 28, 2017
…add improvements notes.

See code comments for details and improvement notes.
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 28, 2017
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 28, 2017
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 28, 2017
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 28, 2017
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 28, 2017
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 28, 2017
…add improvements notes.

See code comments for details and improvement notes.
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 28, 2017
This will allow devs to check out yelp places in different areas to see how
well TA matches.
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 28, 2017
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 28, 2017
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 28, 2017
This will allow us to figure out which places don't have TA data so we can run
crosswalk on it.
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 28, 2017
This will allow us to figure out which places don't have TA data so we can run
crosswalk on it.
@mcomella
Copy link
Contributor

mcomella commented Jan 30, 2017

Update for non-dense area and analysis of Factual crosswalk:

Data: 0.5km crawl around Nashville (36.162963, -86.780758) = 41 places. Adjusting for places where Yelp serves an area rather than a specific location (6), there are 35 places.

  • 74% (26/35) correct matches
    • 6 name mismatches
    • 1 not on TA
    • 2 food trucks
    • Unioning the results of Factual crosswalk, we get to 31/35 (89%)
      • Removing the places that are uncorrectable, there is only one place missing ("Crazy Town Nashville") for 97% (31/32)
      • Factual gets 5 unique places, we get 15 unique
      • Factual alone gets 16 total places

Raw notes added to the gist.


Overall, it seems we're getting about 75% correct from this method. This one test of factual shows 46% correct for TA.

mcomella added a commit to mcomella/prox-server that referenced this issue Jan 31, 2017
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 31, 2017
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 31, 2017
…add improvements notes.

See code comments for details and improvement notes.
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 31, 2017
This will allow devs to check out yelp places in different areas to see how
well TA matches.
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 31, 2017
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 31, 2017
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 31, 2017
This will allow us to figure out which places don't have TA data so we can run
crosswalk on it.
mcomella added a commit to mcomella/prox-server that referenced this issue Jan 31, 2017
liuche pushed a commit that referenced this issue Feb 1, 2017
…ts notes.

See code comments for details and improvement notes.
liuche pushed a commit that referenced this issue Feb 1, 2017
This will allow devs to check out yelp places in different areas to see how
well TA matches.
liuche pushed a commit that referenced this issue Feb 1, 2017
This will allow us to figure out which places don't have TA data so we can run
crosswalk on it.
liuche pushed a commit that referenced this issue Feb 1, 2017
@mcomella
Copy link
Contributor

mcomella commented Feb 2, 2017

We can do yelp -> TA: we just need to integrate (#91).

@mcomella mcomella closed this as completed Feb 2, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants