-
Notifications
You must be signed in to change notification settings - Fork 0
/
v2-reclin.Rmd
196 lines (144 loc) · 5.63 KB
/
v2-reclin.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
title: "Blocking records for record linkage"
author: "Maciej Beręsewicz"
execute:
warning: false
message: false
lang: en
output:
html_vignette:
df_print: kable
toc: true
number_sections: true
fig_width: 6
fig_height: 4
vignette: >
%\VignetteIndexEntry{Blocking records for record linkage}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Setup
Read required packages
```{r setup}
library(blocking)
library(reclin2)
library(data.table)
library(rnndescent)
```
# Data
Read the example data from the tutorial on [the `reclin` package on the URos 2021 Conference](https://github.com/djvanderlaan/tutorial-reclin-uros2021). The data sets are from ESSnet on Data Integration as stated in the repository:
```
These totally fictional data sets are supposed to have captured details of
persons up to the date 31 December 2011. Any years of birth captured as 2012
are therefore in error. Note that in the fictional Census data set, dates of
birth between 27 March 2011 and 31 December 2011 are not necessarily in error.
Census: A fictional data set to represent some observations from a
decennial Census
CIS: Fictional observations from Customer Information System, which is
combined administrative data from the tax and benefit systems
In the dataset census all records contain a person_id. For some of the records
in cis the person_id is also available. This information can be used to
evaluate the linkage (assuming these records from the cis are representable
all records in the cis).
```
```{r}
census <- fread("https://raw.githubusercontent.com/djvanderlaan/tutorial-reclin-uros2021/main/data/census.csv")
cis <- fread("https://raw.githubusercontent.com/djvanderlaan/tutorial-reclin-uros2021/main/data/cis.csv")
```
+ `census` object has `r nrow(census)` rows and `r ncol(census)`,
+ `cis` object has `r nrow(census)` rows and `r ncol(census)`.
Census data
```{r}
head(census)
```
CIS data
```{r}
head(cis)
```
We need to create new columns that concatanates variables from `pername1` to `enumpc`. In the first step we replace `NA`s with `''`.
```{r}
census[, ":="(dob_day=as.character(dob_day), dob_mon=as.character(dob_mon), dob_year=as.character(dob_year))]
cis[, ":="(dob_day=as.character(dob_day), dob_mon=as.character(dob_mon),dob_year=as.character(dob_year))]
census[is.na(dob_day), dob_day := ""]
census[is.na(dob_mon), dob_mon := ""]
census[is.na(dob_year), dob_year := ""]
cis[is.na(dob_day), dob_day := ""]
cis[is.na(dob_mon), dob_mon := ""]
cis[is.na(dob_year), dob_year := ""]
census[, txt:=paste0(pername1, pername2, sex, dob_day, dob_mon, dob_year, enumcap, enumpc)]
cis[, txt:=paste0(pername1, pername2, sex, dob_day, dob_mon, dob_year, enumcap, enumpc)]
```
# Linking datasets
## Using basic functionalities of `blocking` package
The goal of this exercise is to link units from the CIS dataset to the CENSUS dataset.
```{r}
set.seed(2024)
result1 <- blocking(x = census$txt, y = cis$txt, verbose = 1, n_threads = 8)
```
Distribution of distances for each pair.
```{r}
hist(result1$result$dist, main = "Distribution of distances between pairs", xlab = "Distances")
```
Example pairs.
```{r}
head(result1$result, n= 10)
```
Let's take a look at the first pair. Obviously there is a typo in the `pername1`, but all the other variables are the same, so it appears to be a match.
```{r}
cbind(t(census[1, 1:9]), t(cis[8152, 1:9]))
```
## Assessing the quality
For some records, we have information about the correct linkage. We can use this information to evaluate our approach, but note that the information for evaluating quality is described in detail in the other vignette.
```{r}
matches <- merge(x = census[, .(x=1:.N, person_id)],
y = cis[, .(y = 1:.N, person_id)],
by = "person_id")
matches[, block:=1:.N]
head(matches)
```
So in our example we have `r nrow(matches)` pairs.
```{r}
set.seed(2024)
result2 <- blocking(x = census$txt, y = cis$txt, verbose = 1,
true_blocks = matches[, .(x, y, block)], n_threads = 8)
```
Let's see how our approach handled this problem.
```{r}
result2
```
It seems that the default parameters of the NND method result in an FNR of `r sprintf("%.2f",result2$metrics["fnr"]*100)`%. We can see if increasing the number of `k` (and thus `max_candidates`) as suggested in the [Nearest Neighbor Descent
](https://jlmelville.github.io/rnndescent/articles/nearest-neighbor-descent.html) vignette will help.
```{r}
set.seed(2024)
ann_control_pars <- controls_ann()
ann_control_pars$nnd$epsilon <- 0.2
result3 <- blocking(x = census$txt, y = cis$txt, verbose = 1,
true_blocks = matches[, .(x, y, block)], n_threads = 8,
control_ann = ann_control_pars)
```
Changing the `epsilon` search parameter from 0.1 to 0.2 decreased the FDR to `r sprintf("%.1f",result3$metrics["fnr"]*100)`%.
```{r}
result3
```
Finally, compare the NND and HNSW algorithm for this example.
```{r}
result4 <- blocking(x = census$txt, y = cis$txt, verbose = 1,
true_blocks = matches[, .(x, y, block)], n_threads = 8,
ann = "hnsw", seed = 2024)
```
It seems that the HNSW algorithm performed better with `r sprintf("%.2f",result4$metrics["fnr"]*100)`% FNR.
```{r}
result4
```
## Compare results
Finally, we can compare the results of two ANN algorithms. The overlap between neighbours is given by
```{r}
c("no tuning" = mean(result2$result[order(y)]$x == result4$result[order(y)]$x)*100,
"with tuning" = mean(result3$result[order(y)]$x == result4$result[order(y)]$x)*100)
```