The R package fuzzylink
implements a probabilistic record linkage
procedure proposed in Ornstein
(2024). This
method allows users to merge datasets with fuzzy matches on a key
identifying variable. Suppose, for example, you have the following two
datasets:
dfA
#> name age
#> 1 Joe Biden 81
#> 2 Donald Trump 77
#> 3 Barack Obama 62
#> 4 George W. Bush 77
#> 5 Bill Clinton 77
dfB
#> name hobby
#> 1 Joseph Robinette Biden Football
#> 2 Donald John Trump Golf
#> 3 Barack Hussein Obama Basketball
#> 4 George Walker Bush Reading
#> 5 William Jefferson Clinton Saxophone
#> 6 George Herbert Walker Bush Skydiving
#> 7 Biff Tannen Bullying
#> 8 Joe Riley Jogging
We would like a procedure that correctly identifies which records in
dfB
are likely matches for each record in dfA
. The fuzzylink()
function performs this record linkage with a single line of code.
library(fuzzylink)
df <- fuzzylink(dfA, dfB, by = 'name', record_type = 'person')
df
#> A B sim jw
#> 1 Joe Biden Joseph Robinette Biden 0.7666187 0.7208273
#> 2 Donald Trump Donald John Trump 0.8389039 0.9333333
#> 3 Barack Obama Barack Hussein Obama 0.8456774 0.9200000
#> 4 George W. Bush George Walker Bush 0.8446634 0.9301587
#> 5 Bill Clinton William Jefferson Clinton 0.8731945 0.5788889
#> match_probability validated age hobby
#> 1 1 Yes 81 Football
#> 2 1 Yes 77 Golf
#> 3 1 Yes 62 Basketball
#> 4 1 Yes 77 Reading
#> 5 1 Yes 77 Saxophone
The procedure works by using pretrained text embeddings from OpenAI to
construct a measure of similarity for each pair of names. These
similarity measures are then used as predictors in a statistical model
to estimate the probability that two name pairs represent the same
entity. In the guide below, I will walk step-by-step through what’s
happening under the hood when we call the fuzzylink()
function. See
Ornstein
(2024) for
technical details.
You can install the development version of fuzzylink
from
GitHub with:
# install.packages("devtools")
devtools::install_github("joeornstein/fuzzylink")
You will also need an account with OpenAI. You can sign up here, after which you will need to create an API key here. For best performance, I strongly recommend purchasing at least $5 in API credits, which will significantly increase your API rate limits.
Once your account is created, copy-paste your API key into the following line of R code.
library(fuzzylink)
openai_api_key('YOUR API KEY GOES HERE', install = TRUE)
Now you’re all set up!
Here is some code to reproduce the example above and make sure that everything is working on your computer.
library(tidyverse)
library(fuzzylink)
dfA <- tribble(~name, ~age,
'Joe Biden', 81,
'Donald Trump', 77,
'Barack Obama', 62,
'George W. Bush', 77,
'Bill Clinton', 77)
dfB <- tribble(~name, ~hobby,
'Joseph Robinette Biden', 'Football',
'Donald John Trump ', 'Golf',
'Barack Hussein Obama', 'Basketball',
'George Walker Bush', 'Reading',
'William Jefferson Clinton', 'Saxophone',
'George Herbert Walker Bush', 'Skydiving',
'Biff Tannen', 'Bullying',
'Joe Riley', 'Jogging')
df <- fuzzylink(dfA, dfB, by = 'name', record_type = 'person')
df
If the df
object links all the presidents to their correct name in
dfB
, everything is running smoothly! (Note that you may see a warning
from glm.fit
. This is normal. The stats
package gets suspicious
whenever the model fit is too perfect.)
-
The
by
argument specifies the name of the fuzzy matching variable that you want to use to link records. The dataframesdfA
anddfB
must both have a column with this name. -
The
record_type
argument should be a singular noun describing the type of entity theby
variable represents (e.g. “person”, “organization”, “interest group”, “city”). It is used as part of a language model prompt when training the statistical model (see Step 3 below). -
The
model
argument specifies which OpenAI language model to prompt. It defaults to ‘gpt-3.5-turbo-instruct’, but for more difficult problems you can try ‘gpt-4’ or ‘gpt-4-turbo-preview’. Note that this will typically increase accuracy at the expense of cost and runtime (see “A Note on Cost” below). -
Several parameters—including
p
,k
,embedding_dimensions
,max_validations
, andparallel
—are for advanced users who wish to customize the behavior of the algorithm. See the package documentation for more details. -
If there are any variables that must match exactly in order to link two records, you will want to include them in the
blocking.variables
argument. As a practical matter, I strongly recommend including blocking variables wherever possible, as they reduce the time and cost necessary to compute pairwise distance metrics. Suppose, for example, that our two illustrative datasets have a column calledstate
, and we want to instructfuzzylink()
to only link people who live within the same state.
dfA <- tribble(~name, ~state, ~age,
'Joe Biden', 'Delaware', 81,
'Donald Trump', 'New York', 77,
'Barack Obama', 'Illinois', 62,
'George W. Bush', 'Texas', 77,
'Bill Clinton', 'Arkansas', 77)
dfB <- tribble(~name, ~state, ~hobby,
'Joseph Robinette Biden', 'Delaware', 'Football',
'Donald John Trump ', 'Florida', 'Golf',
'Barack Hussein Obama', 'Illinois', 'Basketball',
'George Walker Bush', 'Texas', 'Reading',
'William Jefferson Clinton', 'Arkansas', 'Saxophone',
'George Herbert Walker Bush', 'Texas', 'Skydiving',
'Biff Tannen', 'California', 'Bullying',
'Joe Riley', 'South Carolina', 'Jogging')
df <- fuzzylink(dfA, dfB,
by = 'name',
blocking.variables = 'state',
record_type = 'person')
df
#> A B sim block jw
#> 1 Joe Biden Joseph Robinette Biden 0.7667135 1 0.7208273
#> 2 Barack Obama Barack Hussein Obama 0.8456774 3 0.9200000
#> 3 George W. Bush George Walker Bush 0.8446634 4 0.9301587
#> 4 Bill Clinton William Jefferson Clinton 0.8731945 5 0.5788889
#> 5 Donald Trump <NA> NA NA NA
#> match_probability state validated age hobby
#> 1 1 Delaware Yes 81 Football
#> 2 1 Illinois Yes 62 Basketball
#> 3 1 Texas Yes 77 Reading
#> 4 1 Arkansas Yes 77 Saxophone
#> 5 NA New York <NA> 77 <NA>
Note that because Donald Trump is listed under two different states—New
York in dfA
and Florida in dfB
–the fuzzylink()
function no longer
returns a match for this record; all blocking variables must match
exactly before the function will link two records together. You can
specify as many blocking variables as needed by inputting their column
names as a vector.
The function returns a few additional columns along with the merged
dataframe. The column match_probability
reports the model’s estimated
probability that the pair of records refer to the same entity. This
column should be used to aid in validation and can be used for computing
weighted averages if a record in dfA
is matched to multiple records in
dFB
. The columns sim
and jw
are string distance measures that the
model uses to predict whether two records are a match. And if you
included blocking.variables
in the function call, there will be a
column called block
with an ID variable denoting which block the
records belong to.
If you’d like to know more details about about how fuzzylink()
works,
you can read the accompanying research
paper. In
this section, we’ll take a look under the hood at the previous example,
walking through each of the steps that fuzzylink()
takes to join the
two dataframes.
First, the function encodes each unique string in dfA
and dfB
as a
256-dimensional vector called an embedding. You can learn more about
embeddings
here,
but the basic idea is to represent text using a vector of real-valued
numbers, such that two vectors close to one another in space have
similar meanings.
library(tidyverse)
strings_A <- unique(dfA$name)
strings_B <- unique(dfB$name)
all_strings <- unique( c(strings_A, strings_B) )
embeddings <- get_embeddings(all_strings)
dim(embeddings)
#> [1] 13 256
head(embeddings['Bill Clinton',])
#> [1] 0.08017267 0.07627309 -0.01617664 -0.07971001 -0.09848085 -0.04970309
Next, we compute the cosine similarity between each name pair. This is
our measure of how closely related two pieces of text are, where 0 is
completely unrelated and 1 is identical. If you include
blocking.variables
in the call to fuzzylink()
, the function will
only consider within-block name pairs (i.e. it will only compute
similarity scores for records with an exact match on each blocking
variable). I strongly recommend blocking wherever possible, as it
significantly reduces cost and speeds up computation.
sim <- get_similarity_matrix(embeddings, strings_A, strings_B)
sim
#> Joseph Robinette Biden Donald John Trump Barack Hussein Obama
#> Joe Biden 0.7666187 0.5532721 0.5309486
#> Donald Trump 0.4317744 0.8389039 0.4480156
#> Barack Obama 0.5172067 0.4756720 0.8456774
#> George W. Bush 0.4942308 0.4878543 0.5681931
#> Bill Clinton 0.4885142 0.5038318 0.5173374
#> George Walker Bush William Jefferson Clinton
#> Joe Biden 0.5093797 0.5426070
#> Donald Trump 0.4807618 0.4465016
#> Barack Obama 0.4853952 0.5131033
#> George W. Bush 0.8446634 0.6115912
#> Bill Clinton 0.6233374 0.8731945
#> George Herbert Walker Bush Biff Tannen Joe Riley
#> Joe Biden 0.4701206 0.3014880 0.3908584
#> Donald Trump 0.3945427 0.3438834 0.2332212
#> Barack Obama 0.4242546 0.2546198 0.3482104
#> George W. Bush 0.7335619 0.2458795 0.3608438
#> Bill Clinton 0.5951578 0.2212838 0.3196263
We would like to use those cosine similarity scores to predict whether
two names refer to the same entity. In order to do that, we need to
first create a labeled dataset to fit a statistical model. The
get_training_set()
function selects a sample of name pairs and labels
them using the following prompt to GPT-4 (brackets denote input
variables).
Decide if the following two names refer to the same {record_type}. Misspellings, alternative names, and acronyms may be acceptable matches. Think carefully. Respond "Yes" or "No".'
Name A: {A}
Name B: {B}
Response:
train <- get_training_set(list(sim), record_type = 'person')
train
#> # A tibble: 40 × 5
#> A B sim jw match
#> <fct> <fct> <dbl> <dbl> <chr>
#> 1 Joe Biden "Barack Hussein Obama" 0.531 0.535 No
#> 2 Barack Obama "Biff Tannen" 0.255 0.457 No
#> 3 Donald Trump "Biff Tannen" 0.344 0.399 No
#> 4 Bill Clinton "Joseph Robinette Biden" 0.489 0.480 No
#> 5 Donald Trump "William Jefferson Clinton" 0.447 0.372 No
#> 6 Donald Trump "Donald John Trump " 0.839 0.933 Yes
#> 7 George W. Bush "Donald John Trump " 0.488 0.445 No
#> 8 Joe Biden "Joe Riley" 0.391 0.867 No
#> 9 George W. Bush "Joe Riley" 0.361 0.410 No
#> 10 Bill Clinton "George Herbert Walker Bush" 0.595 0.371 No
#> # ℹ 30 more rows
Next, we fit a logistic regression model on the train
dataset, so that
we can map similarity scores onto a probability that two records match.
We use both the cosine similarity (sim
) and a measure of lexical
similarity (jw
) as predictors in this model.
model <- glm(as.numeric(match == 'Yes') ~ sim + jw,
data = train,
family = 'binomial')
Append these predictions to each name pair in dfA
and dfB
.
# create a dataframe with each name pair
df <- sim |>
reshape2::melt() |>
set_names(c('A', 'B', 'sim')) |>
# compute lexical similarity measures for each name pair
mutate(jw = stringdist::stringsim(A, B, method = 'jw', p = 0.1))
df$match_probability <- predict(model, df, type = 'response')
head(df)
#> A B sim jw match_probability
#> 1 Joe Biden Joseph Robinette Biden 0.7666187 0.7208273 1.000000e+00
#> 2 Donald Trump Joseph Robinette Biden 0.4317744 0.4217172 2.220446e-16
#> 3 Barack Obama Joseph Robinette Biden 0.5172067 0.4191919 2.220446e-16
#> 4 George W. Bush Joseph Robinette Biden 0.4942308 0.5200216 2.220446e-16
#> 5 Bill Clinton Joseph Robinette Biden 0.4885142 0.4797980 2.220446e-16
#> 6 Joe Biden Donald John Trump 0.5532721 0.4444444 2.220446e-16
We now have a dataset with estimated match probabilities for each pair
of records in dfA
and dfB
. We could stop there and just report the
match probabilities. But for larger datasets we can get better results
if we conduct a final validation step. For each name pair within a range
of estimated match probabilities (by default 0.1 to 0.95), we will use
the GPT-4 prompt above to check whether the name pair is a match. These
labeled pairs are then added to the training dataset, the logistic
regression model is refined, and we repeat this process until there are
no matches left to validate. At that point, every record in dfA
is
either linked to a record in dfB
or there are no candidate matches in
dfB
with an estimated probability higher than the threshold.
Note that, by default, the fuzzylink()
function will validate at most
100,000 name pairs during this step. This setting reduces both cost and
runtime (see “A Note On Cost” below), but users who wish to validate
more name pairs within larger datasets can increase the cap using the
max_validations
argument.
# find all unlabeled name pairs within a range of match probabilities
matches_to_validate <- df |>
left_join(train, by = c('A', 'B', 'sim')) |>
filter(match_probability > 0.1,
match_probability < 0.95,
is.na(match))
while(nrow(matches_to_validate) > 0){
# validate matches using LLM prompt
matches_to_validate$match <- check_match(matches_to_validate$A,
matches_to_validate$B)
# append new labeled pairs to the train set
train <- train |>
bind_rows(matches_to_validate |>
select(A,B,sim,match))
# refine the model
model <- glm(as.numeric(match == 'Yes') ~ sim + jw,
data = train,
family = 'binomial')
# re-estimate match probabilities
df$match_probability <- predict(model, df, type = 'response')
# find all unlabeled name pairs within a range of match probabilities
matches_to_validate <- df |>
left_join(train, by = c('A', 'B', 'sim')) |>
filter(match_probability > 0.1,
match_probability < 0.95,
is.na(match))
}
Finally, we take all name pairs whose match probability is higher than a user-specified threshold and merge them into a single dataset.
matches <- df |>
# join with match labels from the training set
left_join(train |> select(A, B, match),
by = c('A', 'B')) |>
# only keep pairs that have been validated or have a match probability > 0.1
filter((match_probability > 0.1 & is.na(match)) | match == 'Yes') |>
right_join(dfA, by = c('A' = 'name'),
relationship = 'many-to-many') |>
left_join(dfB, by = c('B' = 'name'),
relationship = 'many-to-many')
matches
#> A B sim jw
#> 1 Joe Biden Joseph Robinette Biden 0.7666187 0.7208273
#> 2 Donald Trump Donald John Trump 0.8389039 0.9333333
#> 3 Barack Obama Barack Hussein Obama 0.8456774 0.9200000
#> 4 George W. Bush George Walker Bush 0.8446634 0.9301587
#> 5 Bill Clinton William Jefferson Clinton 0.8731945 0.5788889
#> match_probability match state.x age state.y hobby
#> 1 1 Yes Delaware 81 Delaware Football
#> 2 1 Yes New York 77 Florida Golf
#> 3 1 Yes Illinois 62 Illinois Basketball
#> 4 1 Yes Texas 77 Texas Reading
#> 5 1 Yes Arkansas 77 Arkansas Saxophone
Because the fuzzylink()
function makes several calls to the OpenAI
API—which charges a per-token fee—there is
a monetary cost associated with each use. Based on the package defaults
and API pricing as of March 2024, here is a table of approximate costs
for merging datasets of various sizes.
dfA | dfB | Approximate Cost (Default Settings) |
---|---|---|
10 | 10 | $0.02 |
10 | 100 | $0.02 |
10 | 1,000 | $0.02 |
10 | 10,000 | $0.02 |
10 | 100,000 | $0.07 |
10 | 1,000,000 | $0.6 |
100 | 10 | $0.15 |
100 | 100 | $0.15 |
100 | 1,000 | $0.15 |
100 | 10,000 | $0.16 |
100 | 100,000 | $0.21 |
100 | 1,000,000 | $0.74 |
1,000 | 10 | $1.5 |
1,000 | 100 | $1.5 |
1,000 | 1,000 | $1.5 |
1,000 | 10,000 | $1.51 |
1,000 | 100,000 | $1.56 |
1,000 | 1,000,000 | $2.09 |
10,000 | 10 | $15.01 |
10,000 | 100 | $15.01 |
10,000 | 1,000 | $15.01 |
10,000 | 10,000 | $15.01 |
10,000 | 100,000 | $15.06 |
10,000 | 1,000,000 | $15.59 |
100,000 | 10 | $30.06 |
100,000 | 100 | $30.06 |
100,000 | 1,000 | $30.06 |
100,000 | 10,000 | $30.06 |
100,000 | 100,000 | $30.12 |
100,000 | 1,000,000 | $30.64 |
1,000,000 | 10 | $30.59 |
1,000,000 | 100 | $30.59 |
1,000,000 | 1,000 | $30.59 |
1,000,000 | 10,000 | $30.59 |
1,000,000 | 100,000 | $30.64 |
1,000,000 | 1,000,000 | $31.17 |
Note that cost scales more quickly with the size of dfA
than with
dfB
, because it is more costly to complete LLM prompts for validation
than it is to retrieve embeddings. For particularly large datasets, one
can reduce costs by using GPT-3.5 (model = 'gpt-3.5-turbo'
), blocking
(blocking.variables
), or reducing the maximum number of validations
(max_validations
).