## Record Linkage

Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this chapter, you'll learn how to link records by calculating the similarity between strings—you’ll then use your new skills to join two restaurant review datasets into one clean master dataset.

### Small distance, small difference
In the video exercise, you learned that there are multiple ways to calculate how similar or different two strings are. Now you'll practice using the stringdist package to compute string distances using various methods. It's important to be familiar with different methods, as some methods work better on certain datasets, while others work better on other datasets.

In [None]:
# not run / packages not available
install.packages("stringdist")
library(stringdist)

# Calculate Damerau-Levenshtein distance
stringdist("las angelos", "los angeles", method = "dl")

# Calculate Longest Common Substring, LCS, distance
stringdist("los angeles", "las angelos", "lcs")

### Fixing typos with string distance
In this chapter, one of the datasets you'll be working with, zagat, is a set of restaurants in New York, Los Angeles, Atlanta, San Francisco, and Las Vegas. The data is from Zagat, a company that collects restaurant reviews, and includes the restaurant names, addresses, phone numbers, as well as other restaurant information.

The city column contains the name of the city that the restaurant is located in. However, there are a number of typos throughout the column. Your task is to map each city to one of the five correctly-spelled cities contained in the cities data frame.

In [12]:
# install.packages("fuzzyjoin")
library(dplyr)
library(fuzzyjoin)

# data
zagat = readRDS("zagat.rds")

city_actual = c("new york", "los angeles", "atlanta", "san francisco", "las vegas") 
cities = data.frame(city_actual)

# Count the number of each city variation
zagat %>%
  count(city)

city,n
atlanta,64
los angeles,72
new york,98
las vegas,26
san francisco,50


In [None]:
# not run
# Join zagat and cities and look at results
zagat %>%
  # Left join based on stringdist using city and city_actual cols
  stringdist_left_join(cities, by = c("city" = "city_actual")) %>% # to join by strings
  # Select the name, city, and city_actual cols
  select(name, city, city_actual)

### Pair blocking
Zagat and Fodor's are both companies that gather restaurant reviews. The zagat and fodors datasets both contain information about various restaurants, including addresses, phone numbers, and cuisine types. Some restaurants appear in both datasets, but don't necessarily have the same exact name or phone number written down. In this chapter, you'll work towards figuring out which restaurants appear in both datasets.

The first step towards this goal is to generate pairs of records so that you can compare them. In this exercise, you'll first generate all possible pairs, and then use your newly-cleaned city column as a blocking variable.

In [14]:
# install.packages("reclin")
library(reclin)

# data
fodors = readRDS("fodors.rds")

# Generate all possible pairs
pair_blocking(zagat, fodors)

Simple blocking
  No blocking used.
  First data set:  310 records
  Second data set: 533 records
  Total number of pairs: 165 230 pairs

ldat with 165 230 rows and 2 columns
         x   y
1        1   1
2        2   1
3        3   1
4        4   1
5        5   1
6        6   1
7        7   1
8        8   1
9        9   1
10      10   1
:        :   :
165221 301 533
165222 302 533
165223 303 533
165224 304 533
165225 305 533
165226 306 533
165227 307 533
165228 308 533
165229 309 533
165230 310 533

In [16]:
# Generate pairs with same city
pair_blocking(zagat, fodors, blocking_var = "city")

Simple blocking
  Blocking variable(s): city
  First data set:  310 records
  Second data set: 533 records
  Total number of pairs: 40 532 pairs

ldat with 40 532 rows and 2 columns
        x   y
1       1   1
2       1   2
3       1   3
4       1   4
5       1   5
6       1   6
7       1   7
8       1   8
9       1   9
10      1  10
:       :   :
40523 310 414
40524 310 415
40525 310 416
40526 310 417
40527 310 418
40528 310 419
40529 310 420
40530 310 421
40531 310 422
40532 310 423

### Comparing pairs
Now that you've generated the pairs of restaurants, it's time to compare them. You can easily customize how you perform your comparisons using the by and default_comparator arguments. There's no right answer as to what each should be set to, so in this exercise, you'll try a couple options out.

In [17]:
# Generate pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
  # Compare pairs by name using lcs()
  compare_pairs(by = "name",
      default_comparator = lcs())

Compare
  By: name

Simple blocking
  Blocking variable(s): city
  First data set:  310 records
  Second data set: 533 records
  Total number of pairs: 40 532 pairs

ldat with 40 532 rows and 3 columns
        x   y      name
1       1   1 0.3157895
2       1   2 0.3225806
3       1   3 0.2307692
4       1   4 0.2608696
5       1   5 0.4545455
6       1   6 0.2142857
7       1   7 0.1052632
8       1   8 0.2222222
9       1   9 0.3000000
10      1  10 0.4516129
:       :   :         :
40523 310 414 0.3606557
40524 310 415 0.2631579
40525 310 416 0.2105263
40526 310 417 0.3750000
40527 310 418 0.2978723
40528 310 419 0.2727273
40529 310 420 0.3437500
40530 310 421 0.3414634
40531 310 422 0.4081633
40532 310 423 0.1714286

In [18]:
# Generate pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
  # Compare pairs by name, phone, addr
    compare_pairs(by = c("name", "phone", "addr"),
      default_comparator = jaro_winkler())

Compare
  By: name, phone, addr

Simple blocking
  Blocking variable(s): city
  First data set:  310 records
  Second data set: 533 records
  Total number of pairs: 40 532 pairs

ldat with 40 532 rows and 5 columns
        x   y      name     phone      addr
1       1   1 0.4871062 0.6746032 0.5703661
2       1   2 0.5234025 0.5555556 0.6140351
3       1   3 0.4564103 0.7222222 0.5486355
4       1   4 0.5102564 0.6746032 0.6842105
5       1   5 0.5982906 0.5793651 0.5515351
6       1   6 0.3581197 0.6746032 0.4825911
7       1   7 0.0000000 0.6269841 0.5457762
8       1   8 0.4256410 0.6269841 0.4979621
9       1   9 0.5013736 0.7777778 0.6342105
10      1  10 0.6011396 0.6746032 0.4654971
:       :   :         :         :         :
40523 310 414 0.4972291 0.6666667 0.5158263
40524 310 415 0.5778143 0.6746032 0.5065359
40525 310 416 0.4426564 0.6666667 0.4294118
40526 310 417 0.5315404 0.7152778 0.7070387
40527 310 418 0.5271102 0.6111111 0.7135914
40528 310 419 0.5204981 0.6944444 0.5

### Putting it together
During this chapter, you've cleaned up the city column of zagat using string similarity, as well as generated and compared pairs of restaurants from zagat and fodors. The end is near - all that's left to do is score and select pairs and link the data together, and you'll be able to begin your analysis in no time!

In [19]:
# Create pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
  # Compare pairs
  compare_pairs(by = "name", default_comparator = jaro_winkler()) %>%
  # Score pairs probabilistically.
   score_problink() 

"`group_by_()` is deprecated as of dplyr 0.7.0.
Please use `group_by()` instead.
See vignette('programming') for more help

Compare
  By: name

Simple blocking
  Blocking variable(s): city
  First data set:  310 records
  Second data set: 533 records
  Total number of pairs: 40 532 pairs

ldat with 40 532 rows and 4 columns
        x   y      name       weight
1       1   1 0.4871062 -0.018054756
2       1   2 0.5234025  0.034349215
3       1   3 0.4564103 -0.058771317
4       1   4 0.5102564  0.014794851
5       1   5 0.5982906  0.160497213
6       1   6 0.3581197 -0.171215199
7       1   7 0.0000000 -0.440170787
8       1   8 0.4256410 -0.096683808
9       1   9 0.5013736  0.001958745
10      1  10 0.6011396  0.165868942
:       :   :         :            :
40523 310 414 0.4972291 -0.003930282
40524 310 415 0.5778143  0.123235782
40525 310 416 0.4426564 -0.076056611
40526 310 417 0.5315404  0.046802575
40527 310 418 0.5271102  0.039989118
40528 310 419 0.5204981  0.029970093
40529 310 420 0.5635103  0.098522838
40530 310 421 0.4891899 -0.015176894
40531 310 422 0.6204433  0.203563939
40532 310 423 0.42337

In [20]:
# Create pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
  # Compare pairs
  compare_pairs(by = "name", default_comparator = jaro_winkler()) %>%
  # Score pairs
  score_problink() %>%
  # Select pairs
  select_n_to_m()

Compare
  By: name

Simple blocking
  Blocking variable(s): city
  First data set:  310 records
  Second data set: 533 records
  Total number of pairs: 40 532 pairs

ldat with 40 532 rows and 5 columns
        x   y      name       weight select
1       1   1 0.4871062 -0.018054756  FALSE
2       1   2 0.5234025  0.034349215  FALSE
3       1   3 0.4564103 -0.058771317  FALSE
4       1   4 0.5102564  0.014794851  FALSE
5       1   5 0.5982906  0.160497213  FALSE
6       1   6 0.3581197 -0.171215199  FALSE
7       1   7 0.0000000 -0.440170787  FALSE
8       1   8 0.4256410 -0.096683808  FALSE
9       1   9 0.5013736  0.001958745  FALSE
10      1  10 0.6011396  0.165868942  FALSE
:       :   :         :            :      :
40523 310 414 0.4972291 -0.003930282  FALSE
40524 310 415 0.5778143  0.123235782  FALSE
40525 310 416 0.4426564 -0.076056611  FALSE
40526 310 417 0.5315404  0.046802575  FALSE
40527 310 418 0.5271102  0.039989118  FALSE
40528 310 419 0.5204981  0.029970093  FALSE
40529 

In [21]:
# Create pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
  # Compare pairs
  compare_pairs(by = "name", default_comparator = jaro_winkler()) %>%
  # Score pairs
  score_problink() %>%
  # Select pairs
  select_n_to_m() %>%
  # Link the two data frames together.
   link()


id.x,name.x,addr.x,city.x,phone.x,type.x,class.x,id.y,name.y,addr.y,city.y,phone.y,type.y,class.y
0,apple pan the,10801 w. pico blvd.,los angeles,310-475-3585,american,534,124,california pizza kitchen,207 s. beverly dr.,los angeles,310-275-1101,californian,121
1,asahi ramen,2027 sawtelle blvd.,los angeles,310-479-2231,noodle shops,535,128,chan dara,310 n. larchmont blvd.,los angeles,213-467-1052,asian,125
2,baja fresh,3345 kimber dr.,los angeles,805-498-4049,mexican,536,121,ca ` brea,346 s. la brea ave.,los angeles,213-938-2863,italian,118
3,belvedere the,9882 little santa monica blvd.,los angeles,310-788-2306,pacific new wave,537,131,dive !,10250 santa monica blvd.,los angeles,310-788-,dive american,128
4,benita's frites,1433 third st. promenade,los angeles,310-458-2889,fast food,538,149,louise's trattoria,4500 los feliz blvd.,los angeles,213-667-0777,italian,146
5,bernard's,515 s. olive st.,los angeles,213-612-1580,continental,539,172,trader vic's,9876 wilshire blvd.,los angeles,310-276-6345,asian,169
6,bistro 45,45 s. mentor ave.,los angeles,818-795-2478,californian,540,118,bistro garden,176 n. canon dr.,los angeles,310-550-3900,californian,115
8,brighton coffee shop,9600 brighton way,los angeles,310-276-7732,coffee shops,542,139,gladstone's,4 fish 17300 pacific coast hwy . at sunset blvd.,los angeles,310-454-3474,american,136
9,bristol farms market cafe,1570 rosecrans ave. s.,los angeles,310-643-5229,californian,543,129,clearwater cafe,168 w. colorado blvd.,los angeles,818-356-0959,health food,126
11,cafe'50s,838 lincoln blvd.,los angeles,310-399-1955,american,545,157,paty's,10001 riverside dr.,los angeles,818-761-9126,american,154
