/
normalize-geography.Rmd
348 lines (289 loc) · 10.9 KB
/
normalize-geography.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
---
title: "Normalize Geographic Variables"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Normalize Geographic Variables}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
library(knitr)
```
```{r campfin}
library(campfin)
packageVersion("campfin")
```
```{r setup, warning=FALSE, message=FALSE, error=FALSE}
library(dplyr)
library(readr)
library(stringr)
```
## Background
This vignette contains an example of the workflow used by the wranglers on
[The Accountability Project][tap], a tool created by
[The Investigative Reporting Workshop][irw] in Washington, DC. The
Accountability Project curates, cleans, and indexes public data to give
journalists, researchers, and the public a simple way to search across otherwise
siloed records. The data focuses on people, organizations and locations.
Since state-level campaign finance data is typically reported by the campaigns
and provided to them by the contributors themselves, there is often great
disparity in data quality. The campfin package was created to reduce this
disparity in a consistent, confident, and programmatic way.
[tap]: https://www.publicaccountability.org/
[irw]: https://investigativereportingworkshop.org/
## Messy Data
In this vignette, we will clean some exaggerated fake messy contribution data
from a fictitious campaign in Vermont. This data is found in the
`vt_contribs.csv` file included with our package.
```{r view_messy}
ex_file <- system.file("extdata", "vt_contribs.csv", package = "campfin")
```
```{r echo=FALSE}
kable(read_csv(ex_file, col_types = cols(.default = col_character())))
```
What are some of the potential problems we can see in this data?
* The `date` column is not parsed as an R date, making it impossible to perform
mathematical calculations like `min()`.
* There is one negative `amount` value and another that's zero. Not necessarily
uncommon, but worth noting.
* One record is missing both the contributor's `name`.
* One record is duplicated across every column.
* In `address` we see:
* Inconsistent capitalization.
* A mix of full and abbreviated suffixes.
* Alternative versions of `NA`.
* Unnecessary and inconsistent punctuation.
* Excess trailing and internal white space.
* Incorrect punctuation and replaced spaces.
* Repeating character strings used instead of `NA`.
* In `city` we see many of the same problems, plus:
* Geographic abbreviations.
* Repeated `state` information.
* Misspellings.
* Colloquial city abbreviations.
* In `state` we see a mix of full and abbreviated state names.
* In `zip`:
* Repeated digits used for `NA` values (NB: 55555 is a valid ZIP code,
99999 is not).
* Unnecessary and inconsistent [ZIP+4][zip4] usage.
* Leading zeroes [dropped by Excel][excel] or some other program.
[zip4]: https://en.wikipedia.org/wiki/ZIP_Code#ZIP+4 "zip4"
[excel]: https://support.microsoft.com/en-us/office/display-numbers-as-postal-codes-61b55c9f-6fe3-4e54-96ca-9e85c38a5a1d?ui=en-us&rs=en-us&ad=us "excel"
While this data is obviously much smaller and more full of errors than real
campaign finance data, these errors are not uncommon and need to be addressed.
The campfin package contains many of the tools we need to first find and then
fix these common problems.
## Read
In most cases, the first step is to download and read the file from a state
agency. When reading the data with the popular `readr::read_delim()` function,
the `col_date_mdy()` function can be used as a quick shortcut for
`readr::col_date(format = "%m/%d/%Y")`, the format most commonly found in U.S.
campaign finance data.
```{r read_messy}
vt <- read_csv(
file = ex_file,
trim_ws = FALSE,
na = c("", "NA", "N/A"),
col_types = cols(
amount = col_number(),
date = col_date_mdy()
)
)
```
We can see how the new `date` column is an actual date object, allowing for
mathematical manipulation.
```{r date_math}
min(vt$date)
```
Next, we should try to normalize our data as much as possible. We can use
some simple counting functions and built in vectors to check the cleanliness
of our raw data.
```{r prop_valid_before}
prop_in(vt$city, str_to_lower(valid_city))
prop_in(vt$state, valid_state)
prop_in(vt$zip, valid_zip)
```
```{r glimpse_fun}
col_stats(vt, n_distinct)
col_stats(vt, count_na)
```
A typical benchmark is to reach greater than 95% _valid_. That is, we want to
normalize our data enough that less than 5% of our data can not be easily
confirmed as valid using a fairly comprehensive list of cities, states, and
ZIP codes.
We will first try to reach this threshold by normalizing our data. This process
involves reducing inconsistencies through string manipulation. There are
separate `normal_*()` functions for each of the 4 types of geographic variables.
Typically we use `dplyr::mutate()` to create _new_, normalized versions of our
messy columns, preserving the old data for transparency. Here, we will just
overwrite our example data for simplicity.
```{r normal_address}
vt <- vt %>%
mutate(
address = normal_address(
address = address,
abbs = usps_street,
na = invalid_city,
na_rep = TRUE
),
city = normal_city(
city = city,
abbs = usps_city,
states = "VT",
na = invalid_city
),
state = normal_state(
state = state,
abbreviate = TRUE,
na_rep = TRUE,
valid = valid_state
),
zip = normal_zip(
zip = zip,
na_rep = TRUE
)
)
```
We can see how these functions and our built in data was used to normalize the
geographic contributor data and remove anything that didn't present real
information. This format is much more easily explored and search.
```{r showal, echo=FALSE}
vt %>% select(address, city, state, zip)
```
## Cities
However, the problem has not been solved. City names are the most troublesome;
There are so many city names and such great variety (compared to states and
ZIP codes), that it can be difficult to normalize and difficult to assess.
Our `valid_city` vector contains many city names, but far less than exist in the
country, especially when you account for neighborhoods that aren't _really_
cities, but shouldn't be changed (some of these are contained in our curated
`extra_city` vector).
```{r length_city}
length(valid_city)
sample(valid_city, 6)
sample(extra_city, 6)
# combine both vectors
many_city <- c(valid_city, extra_city)
```
Still, checking against this list is a good way to check for values that need
additional attention.
```{r bad_city, echo=FALSE}
(bad <- vt %>%
select(1, 7:9) %>%
filter(!is.na(city)) %>%
mutate(valid = city %in% many_city) %>%
filter(!valid))
```
It might not be clear what's actually wrong with these values. A common way to
check is by comparing them against _expected_ city for a given ZIP code.
```{r bad_join}
bad <- left_join(
x = bad,
y = zipcodes,
by = c("zip", "state"),
suffix = c("_raw", "_match")
)
```
```{r echo=FALSE}
kable(select(bad, -valid))
```
Now the problems become clear. Two city names are misspelled and the third is
an obvious abbreviation. When dealing with millions of city names, we need a
way to check each raw value against it's expected ZIP code match.
The `str_dist()` and `is_abbrev()` functions can be used to compared the value
we have with the value we expect. By only checking against the corresponding
city to that record's ZIP code, we are making extremely _confident_ changes
(compared to the incredibly useful clustering algorithms like those provided by
the `refinr` package).
First, we can use `str_dist()` to check the _distance_ between the two strings;
distance is defined as the number of changes we'd need to make to our normalized
value to get our expected matched value. If that distance is small (usually 1 or
2), we can pretty confidently use the matched value.
```{r str_dist}
str_dist("example", "xampel")
```
But the string distance does not catch colloquial city abbreviations (e.g., NYC,
BOS, LA, CHI, ABQ, BRNX, DFW, OKC). Many residents get so used to writing their
city's name they use abbreviations and assume them to be universally understood.
The `is_abbrev()` function can be used to check to one string might be an
abbreviation for another. Every abbreviation generated by the `abbreviate()`
function satisfied the requirements of `is_abbrev()`.
```{r is_abbrev}
is_abbrev(abb = "NYC", full = "New York City")
is_abbrev(abb = "DC", full = "Washington")
```
```{r check_city}
bad <- bad %>%
mutate(
match_dist = str_dist(city_raw, city_match),
match_abb = is_abbrev(city_raw, city_match)
)
```
```{r echo=FALSE}
kable(select(bad, -valid))
```
Here's what this process would look like when employed on an entire data frame.
It's important to ensure that the number of rows in our campaign finance data
is kept consistent throughout the wrangling process and that original columns
are left unchanged.
```{r swap_city}
vt <- vt %>%
rename(city_raw = city) %>%
# match city by ZIP
left_join(zipcodes) %>%
rename(city_match = city) %>%
mutate(
# check against match
match_dist = str_dist(city_raw, city_match),
match_abb = is_abbrev(city_raw, city_match),
city = ifelse(match_abb | match_dist == 1, city_match, city_raw)
) %>%
# remove intermediary columns
select(-city_raw, -city_match, -match_dist, -match_abb)
```
Now every `city`, `state`, and `zip` value is contained in our list of valid
values.
```{r show_swap, echo=FALSE}
vt %>%
select(1, 7:9) %>%
filter(!is.na(city)) %>%
mutate(
all_valid = all(
city %in% valid_city,
state %in% valid_state,
zip %in% valid_zip
)
) %>%
kable()
```
Once our data is as normal as we can confidently make it, we can begin to
explore. First, we'll explore the data for missing values with `flag_na()`,
which takes a [tidyselect] input of columns (or something like
`dplyr::everything()`).
```{r flag_na}
(vt <- flag_na(vt, name))
```
Next, we'll want to check for duplicate rows using `flag_dupes()`, which takes
the same kind of arguments. Here, we can ignore the supposedly unique `id`
variable. It's possible for a person to make the same contribution on the same
date, but we should flag them nonetheless.
```{r flag_dupes}
(vt <- flag_dupes(vt, -id, .both = TRUE))
```
[tidyselect]: https://github.com/r-lib/tidyselect "tidyselect"
## Conclude
This normalized data is now ready to be uploaded to the Accountability Project
and searched alongside [_1 billion_ other records][1bil]! These cleaned names
and addresses might bring up search results alongside one of our other sets of
public data: campaign expenditures, registered voters, nonprofit organizations,
stimulus spending, government contracts, lobbyist registrations, etc.
[1bil]: https://archive.investigativereportingworkshop.org/2021/02/17/1-billion-records-and-counting-the-accountability-project-reaches-new-milestone/
```{r echo=FALSE}
kable(vt)
```