Kiernan Nicholls Mon Nov 8 11:58:11 2021
- Project
- Objectives
- Packages
- Data
- Download
- Unzip
- Columns
- Read
- Join
- Wrangle
- Join
- Explore
- Conclude
- Export
- Upload
The Accountability Project is an effort to cut across data silos and give journalists, policy professionals, activists, and the public at large a simple way to search across huge volumes of public data about people and organizations.
Our goal is to standardizing public data on a few key fields by thinking of each dataset row as a transaction. For each transaction there should be (at least) 3 variables:
- All parties to a transaction.
- The date of the transaction.
- The amount of money involved.
This document describes the process used to complete the following objectives:
- How many records are in the database?
- Check for entirely duplicated records.
- Check ranges of continuous variables.
- Is there anything blank or missing?
- Check for consistency issues.
- Create a five-digit ZIP Code called
zip
. - Create a
year
field from the transaction date. - Make sure there is data on both parties to a transaction.
The following packages are needed to collect, manipulate, visualize,
analyze, and communicate these results. The pacman
package will
facilitate their installation and attachment.
if (!require("pacman")) {
install.packages("pacman")
}
pacman::p_load(
tidyverse, # data manipulation
lubridate, # datetime strings
gluedown, # printing markdown
janitor, # clean data frames
campfin, # custom irw tools
aws.s3, # aws cloud storage
refinr, # cluster & merge
scales, # format strings
knitr, # knit documents
vroom, # fast reading
rvest, # scrape html
glue, # code strings
here, # project paths
httr, # http requests
fs # local storage
)
This document should be run as part of the R_campfin
project, which
lives as a sub-directory of the more general, language-agnostic
irworkshop/accountability_datacleaning
GitHub repository.
The R_campfin
project uses the RStudio
projects
feature and should be run as such. The project also uses the dynamic
here::here()
tool for file paths relative to your machine.
# where does this document knit?
here::i_am("nv/expends/docs/nv_expends_diary.Rmd")
The Nevada Secretary of State (NVSOS) office requires that one register for an account to access “bulk data download” service page.
Welcome to the Nevada Secretary of State online unified login system. Here you may access the following systems all with one login account: * Bulk Data Download * …
The process for downloaded a report is outlined here:
Create a report for “Full Unabridged Database Dump” of “Campaign Finance” data.
This report will expose Contributions and Expenses report data filed within our “Aurora” Campaign Financial Disclosure system. This would not include data filed in a Financial Disclosure report. This bulk data report tool here should be used to pull the entire database or slightly smaller subsets of data such as all contributions filed after 1/1/2016 by groups of type “PAC”…
The site allows users to define the format for their data download. The site generated the following summary of our specified data format:
If “Text File - Fixed Width” is selected your report results will be inserted into a standard ASCII text file where each field starts at a specific “fixed” position for each line. For more specific information about the report format for custom built reports, including the position and data type of each field, click the “View Selected File Structure” button on the General tab when you viewyour report.
This file structure report is an HTML page with a description and six tables.
Your report will generate 6 fixed width ASCII text file(s) compressed into one Zip file named in the format “CampaignFinance.43993.<Today’s Date>.zip”*. Below you will find the format of each file:
The data being reported off of is no more than 24 hours old. This data is copied very late each night from live data to minimize the large burden of bulk reporting on the production system.
The report data is split into multiple files, per the NVSOS FAQ page:
This is what is referred to as a normalized relational structure in the database world. Data items such as business entities and officers have a direct relation to one another. There can be any number of officers to one business entity. Because of this many to one relationship, the officers data is stored in a different data table (or file) than the business entities. Then we relate officer records to a business entity record by a common key data column, in this case the CorporationID… By separating officers and entities into separate records we can eliminate the redundancy and added size associated with putting the business entity data on each officer record or eliminate the complexity of allocating an undeterminable amount of officers on the one business entity record. This same many-to-one relationship is true of voter history records to voter records, UCC actions to UCC liens or Corporation Stocks to Corporations, to name a few.
The summary continues to provide individual structure summaries on each of the six files included in the report along with an key to the file name. These six tables contain columns describing both the data type and the width of each column. This information is needed to properly read the flat text files.
st_dir <- here("nv", "expends")
about_path <- path(st_dir, "File Format - Secretary of State, Nevada.html")
about <- read_html(x = about_path)
about_tables <- about %>%
html_nodes(".entryform") %>%
html_table(fill = TRUE) %>%
map(as_tibble)
about_tables <- about_tables[map_lgl(about_tables, ~ncol(.) == 4)]
about_tables <- map(about_tables, row_to_names, row_number = 1)
about_names <- str_subset(html_text(html_nodes(about, "b")), "\\d")
Field Name | Data Type | Start Position | Length |
---|---|---|---|
CandidateID (Primary Key) | int | 1 | 10 |
First Name | varchar(25) | 11 | 25 |
Last Name | varchar(25) | 36 | 25 |
Party | varchar(60) | 61 | 60 |
Office | varchar(60) | 121 | 60 |
Jurisdiction | varchar(50) | 181 | 50 |
Mailing Address | varchar(50) | 231 | 50 |
Mailing City | varchar(25) | 281 | 25 |
Mailing State | varchar(2) | 306 | 2 |
Mailing Zip | varchar(9) | 308 | 9 |
Field Name | Data Type | Start Position | Length |
---|---|---|---|
GroupID (Primary Key) | int | 1 | 10 |
Group Name | varchar(120) | 11 | 120 |
Group Type | varchar(100) | 131 | 100 |
Contact Name | varchar(35) | 231 | 35 |
Active | bit | 266 | 1 |
City | varchar(30) | 267 | 30 |
Field Name | Data Type | Start Position | Length |
---|---|---|---|
ReportID (Primary Key) | int | 1 | 10 |
CandidateID (Foreign Key Ref Candidates.CandidateID) | int | 11 | 10 |
GroupID (Foreign Key Ref Groups.GroupID) | int | 21 | 10 |
Report Name | varchar(120) | 31 | 120 |
Election Cycle | varchar(4) | 151 | 4 |
Filing Due Date | datetime | 155 | 10 |
Filed Date | datetime | 165 | 10 |
Amended | bit | 175 | 1 |
Superseded | bit | 176 | 1 |
Field Name | Data Type | Start Position | Length |
---|---|---|---|
ContactID (Primary Key) | int | 1 | 10 |
First Name | varchar(30) | 11 | 30 |
Middle Name | varchar(30) | 41 | 30 |
Last Name | varchar(100) | 71 | 100 |
Address 1 | varchar(60) | 171 | 60 |
Address 2 | varchar(30) | 231 | 30 |
City | varchar(45) | 261 | 45 |
State | varchar(2) | 306 | 2 |
Zip | varchar(10) | 308 | 10 |
Field Name | Data Type | Start Position | Length |
---|---|---|---|
ContributionID (Primary Key) | int | 1 | 10 |
ReportID (Foreign Key Ref Reports.ReportID) | int | 11 | 10 |
CandidateID (Foreign Key Ref Candidates.CandidateID) | int | 21 | 10 |
GroupID (Foreign Key Ref Groups.GroupID) | int | 31 | 10 |
Contribution Date | datetime | 41 | 10 |
Contribution Amount | money | 51 | 21 |
Contribution Type | varchar(30) | 72 | 30 |
ContributorID (Foreign Key Ref Contributors-Payees.ContactID) | int | 102 | 10 |
Field Name | Data Type | Start Position | Length |
---|---|---|---|
ExpenseID (Primary Key) | int | 1 | 10 |
ReportID (Foreign Key Ref Reports.ReportID) | int | 11 | 10 |
CandidateID (Foreign Key Ref Candidates.CandidateID) | int | 21 | 10 |
GroupID (Foreign Key Ref Groups.GroupID) | int | 31 | 10 |
Expense Date | datetime | 41 | 10 |
Expense Amount | money | 51 | 21 |
Expense Type | varchar(30) | 72 | 30 |
Payee ID (Foreign Key Ref Contributors-Payees.ContactID) | int | 102 | 10 |
The Data Types within this column correspond to the SQL Server 2012 data types in which the source data is stored. The purpose of exposing these data types is simply to provide a suggested guideline for any software programmers writing an interface to process these report files. Below is a partial list of SQL Server data types:
bigint
- Numeric, 8 bytesint
- Numeric, 4 bytessmallint
- Numeric, 2 bytestinyint
- Numeric, 1 bytebit
- Results are Truewhich will be represented as “T”"F”money
- Monetary data, 8 bytes, accuracy to a ten-thousandth of a unitfloat
- Floating precision number from -1.79E + 308 through 1.79E + 308.real
- Floating precision number from -3.40E + 38 through 3.40E + 38.datetime
- Datechar
- Fixed-length character datavarchar
- Variable-length data with a maximum of 8,000 characters
Definitions for few variables can be found in the “Result Field” tab:
-
Jurisdiction
: > This will be name of the city or county for city/county offices currently held by the candidate (e.g. “CITY OF YERINGTON”, “DOUGLAS COUNTY”). This will be set to “NV SOS” for statewide offices such as Governor, State Controller or State assemblymen. An office assigned to a candidate could be updated by the NV SOS Elections staff as necessary when that candidate files for a new office. -
Contribution Type
: > Use this column to differentiate which one of four contribution types this contribution record is: Monetary Contribution, In Kind Contribution, In Kind Written Commitment, or Written Commitment. -
Last Name
: > When the contributor or payee is an organization as opposed to an individual, the entire organization name will be in the Last Name field only. -
Expense Type
: > Use this column to differentiate which type of expense record this is: Monetary Expense or In Kind Expense. -
Active
: > A value of F (False) indicates the group has been marked as inactive by the NV Secretary of State’s office Elections division due to submission of a “notice of inactivity” or for failure to renew annual registration. -
Amended
: > A value of T (True) indicates this contributions and expense report has been marked as an amended report by the original filer implying this report supersedes a report for this same period, filed earlier. An amended report is to be full comprehensive for that report period and in essence replaces all contributions and expenses filed in the earlier report. -
Election Cycle
: > The Election Cycle is the 4 digit filing or reporting year defining a filing period grouping together a collection of contribution and expenses reports… -
Superseded
: > A report is Superseded when an amended report was filed later by the same filer for the same reporting period. In this case the Superseded field for the older report record will be set to T (True)…
Total number of records returned: 1,116,003
total_rows <- 1116003
A record is one single entity or row from a database table. The “Total number of records returned” displayed on the report preview page will be a summation of all rows returned from each table you are reporting from. For example, if your report queries for both Resident Agent and Corporation data from the Corporations database, the number of records returned might be 1000, 700 of which might be Corporation records and 300 being Resident Agent records.
Within seven days of running the report, the data can be downloaded from the link provided to the account email address. The link will not work for anybody not logged into that NVSOS account.
con_dir <- here("nv", "contribs", "data", "raw")
exp_dir <- here("nv", "expends", "data", "raw")
raw_dir <- dir_create(ifelse(dir_exists(con_dir), con_dir, exp_dir))
raw_url <- "https://www.nvsos.gov/yourreports/CampaignFinance.43993.102721094009.zip"
raw_zip <- path(raw_dir, basename(raw_url))
This URL contains the date the report was generated.
report_time <- mdy_hms(str_extract(raw_url, "\\d+(?=\\.zip$)"))
with_tz(report_time, tzone = "PST")
#> [1] "2021-10-27 09:40:09 PST"
aws_key <- path("IRW/raw_backup/nv", basename(raw_zip))
aws_bkt <- "publicaccountability"
if (!file_exists(raw_zip)) {
download.file(raw_url, raw_zip)
put_object(
file = raw_zip,
object = aws_key,
bucket = aws_bkt,
acl = "public-read",
show_progress = TRUE,
multipart = TRUE
)
} else if (object_exists(aws_key, aws_bkt)) {
save_object(
object = aws_key,
bucket = aws_bkt,
file = raw_zip
)
}
#> /home/kiernan/Documents/tap/R_tap/nv/contribs/data/raw/CampaignFinance.43993.102721094009.zip
This raw ZIP archive has been backed up to the IRW server.
as_fs_bytes(object_size(object = aws_key, bucket = aws_bkt))
#> 17.1M
The provided ZIP archive contains all six tables as fixed width text files.
raw_txt <- unzip(raw_zip, exdir = raw_dir)
Name | Length | Date |
---|---|---|
CampaignFinance.Cnddt.43993.102721094009.txt |
2.2M | 2021-10-27 09:40:00 |
CampaignFinance.Cntrbt.43993.102721094009.txt |
56.92M | 2021-10-27 09:40:00 |
CampaignFinance.Cntrbtrs-.43993.102721094009.txt |
67.96M | 2021-10-27 09:40:00 |
CampaignFinance.Expn.43993.102721094009.txt |
33.44M | 2021-10-27 09:40:00 |
CampaignFinance.Grp.43993.102721094009.txt |
380.07K | 2021-10-27 09:40:00 |
CampaignFinance.Rpr.43993.102721094009.txt |
7.73M | 2021-10-27 09:40:00 |
We need to match the order of this vector to the order of the tables.
names(raw_txt) <- c(
"Candidates",
"Contributions",
"Contributors",
"Expenses",
"Groups",
"Reports"
)
raw_txt <- raw_txt[match(names(about_tables), names(raw_txt))]
names(raw_txt) == names(about_tables)
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE
We can use the tables read from the HTML file, and described in the
About section above, to create (1) the column width tables expected
by read_fwf()
, and (2) the readr column type specification objects.
Two functions will take the Field Name
, Data Type
and Length
as_fwf_width <- function(.data) {
fwf_widths(
widths = as.integer(x = .data[[4]]),
col_names = str_remove(
string = .data[[1]],
pattern = "\\s\\(.*\\)"
)
)
}
as_col_spec <- function(.data) {
x <- .data[["Data Type"]]
x <- case_when(
str_detect(x, "varchar") ~ "c",
str_detect(x, "datetime") ~ "D",
str_detect(x, "money") ~ "d",
str_detect(x, "bit") ~ "l",
str_detect(x, "int") ~ "i",
TRUE ~ "c"
)
as.col_spec(
x = setNames(
object = x,
nm = str_remove(
string = .data[["Field Name"]],
pattern = "\\s\\(.*\\)"
)
)
)
}
about_tables$Groups
#> # A tibble: 6 × 4
#> `Field Name` `Data Type` `Start Position` Length
#> <chr> <chr> <chr> <chr>
#> 1 GroupID (Primary Key) int 1 10
#> 2 Group Name varchar(120) 11 120
#> 3 Group Type varchar(100) 131 100
#> 4 Contact Name varchar(35) 231 35
#> 5 Active bit 266 1
#> 6 City varchar(30) 267 30
as_fwf_width(about_tables$Groups)
#> # A tibble: 6 × 3
#> begin end col_names
#> <int> <int> <chr>
#> 1 0 10 GroupID
#> 2 10 130 Group Name
#> 3 130 230 Group Type
#> 4 230 265 Contact Name
#> 5 265 266 Active
#> 6 266 296 City
as_col_spec(about_tables$Groups)
#> cols(
#> GroupID = col_integer(),
#> `Group Name` = col_character(),
#> `Group Type` = col_character(),
#> `Contact Name` = col_character(),
#> Active = col_logical(),
#> City = col_character()
#> )
raw_widths <- map(about_tables, as_fwf_width)
raw_types <- map(about_tables, as_col_spec)
One file seems to have been exported with some empty rows and other rows without sufficient padding. This can be fixed with string manipulation.
tmp <- file_temp(ext = "txt")
raw_txt[["Contributors"]] %>%
read_lines(skip_empty_rows = TRUE) %>%
str_subset(pattern = "^\\d") %>%
str_pad(width = 170, side = "right", pad = " ") %>%
write_lines(file = tmp)
raw_txt[["Contributors"]] <- tmp
All six tables can then be read into a list using readr::read_fwf()
and the (1) width tables and (2) column type specifications.
nv <- pmap( # 1,116,010
.f = read_fwf,
locale = locale(
date_format = "%m/%d/%Y",
tz = "US/Pacific"
),
.l = list(
file = raw_txt,
col_positions = raw_widths,
col_types = raw_types
)
)
The total number of rows read matches what we were told when exporting.
comma(sum(map_dbl(nv, nrow)))
#> [1] "1,116,040"
enframe(map_dbl(nv, nrow))
#> # A tibble: 6 × 2
#> name value
#> <chr> <dbl>
#> 1 Candidates 7263
#> 2 Groups 1306
#> 3 Reports 45561
#> 4 Contributors 223408
#> 5 Contributions 528200
#> 6 Expenses 310302
The primary table of interest here is Expenses
, which lists the
310,302 expenditures made to committees and reported to the state. This
table does not identify the spending committee or receiving entity. This
information is found in the Groups
and Contributors
tables,
respectively. We need to add variables identifying all parties to each
contribution.
nv$Expenses
#> # A tibble: 310,302 × 8
#> expense_id report_id candidate_id group_id expense_date expense_amount expense_type payee_id
#> <int> <int> <int> <int> <date> <dbl> <chr> <int>
#> 1 2 6980 NA 1220 2006-07-05 1688. Monetary Expense 4
#> 2 3 6980 NA 1220 2006-07-31 52971. Monetary Expense 5
#> 3 4 6980 NA 1220 2006-07-25 911. Monetary Expense 6
#> 4 5 6980 NA 1220 2006-07-05 1764. Monetary Expense 7
#> 5 6 6980 NA 1220 2006-06-23 108. Monetary Expense 7
#> 6 7 6983 NA 1332 2006-01-01 0 Monetary Expense 9
#> 7 8 6983 NA 1332 2006-01-01 0 In Kind Expense 9
#> 8 9 6985 1165 NA 2006-06-20 50.6 Monetary Expense 10
#> 9 10 6985 1165 NA 2006-07-03 57.0 Monetary Expense 11
#> 10 11 6985 1165 NA 2006-07-12 68.1 Monetary Expense 12
#> # … with 310,292 more rows
We will first join the committees. Expenditures can be made to either a candidate or committee, each listed in their own table with their own key column in Expenses. We can combine these keys and tables.
rec_cands <- nv$Candidates %>%
select(candidate_id, first_name, last_name) %>%
filter(candidate_id %in% nv$Expenses$candidate_id) %>%
mutate(recipient_type = "Candidate", .after = last_name) %>%
unite(
col = recipient_name,
first_name, last_name,
sep = " ",
remove = TRUE,
na.rm = TRUE
)
rec_comms <- nv$Groups %>%
filter(group_id %in% nv$Expenses$group_id) %>%
select(group_id, recipient_name = group_name, recipient_type = group_type)
all_recipients <- rec_cands %>%
bind_rows(rec_comms) %>%
relocate(group_id, .after = candidate_id)
#> # A tibble: 14 × 4
#> candidate_id group_id recipient_name recipient_type
#> <int> <int> <chr> <chr>
#> 1 5005 NA Wesley Duncan Candidate
#> 2 13454 NA Jacob Deaville Candidate
#> 3 NA 6918 Committee to Elect Karl Armstrong Independent Expenditure
#> 4 NA 2293 Working Families for Assembly 13 Independent Expenditure
#> 5 NA 2420 Everytown for Gun Safety Action Fund, Inc. (Nonprofit) Non-Profit Corporation
#> 6 NA 1967 Nevada State Education Association Non-Profit Corporation
#> 7 NA 2339 Nevadans for a Court of Appeals PAC Ballot Advocacy Group
#> 8 NA 1626 School, Road and Safety Funding Solutions Ballot Advocacy Group PAC Ballot Advocacy Group
#> 9 NA 3661 CITIZENS TO PRESERVE NEIGHBORHOODS Political Action Committee
#> 10 NA 1267 ACEC Nevada PAC Political Action Committee
#> 11 NA 1225 Clark County Republican Party Political Party Committee
#> 12 NA 1373 Pahrump Valley Republican Women Political Party Committee
#> 13 NA 3680 Committee to Recall Senator Nicole Cannizzaro Recall Committee
#> 14 NA 6957 Expel Michele Recall Committee
all_recipients <- mutate(
.data = all_recipients,
.before = 1,
.keep = "unused",
recipient_id = coalesce(
candidate_id,
group_id
)
)
nve <- nv$Expenses %>%
mutate(
.after = group_id,
.keep = "unused",
recipient_id = coalesce(
candidate_id,
group_id
)
) %>%
left_join(
y = all_recipients,
by = "recipient_id"
) %>%
relocate(
.after = recipient_id,
recipient_name, recipient_type
)
all_payees <- nv$Contributors %>%
filter(contact_id %in% nv$Expenses$payee_id)
To improve the searchability of the database, we will perform some
consistent, confident string normalization. For geographic variables
like city names and ZIP codes, the corresponding campfin::normal_*()
functions are tailor made to facilitate this process.
For the street addresss
variable, the campfin::normal_address()
function will force consistence case, remove punctuation, and abbreviate
official USPS suffixes.
addr_norm <- all_payees %>%
distinct(address_1, address_2) %>%
mutate(
across(
starts_with("address_"),
list(xnorm = normal_address),
abbs = usps_street
)
) %>%
unite(
col = address_norm,
ends_with("_xnorm"),
sep = " ",
remove = TRUE,
na.rm = TRUE
) %>%
mutate(across(address_norm, na_if, ""))
addr_norm
#> # A tibble: 44,203 × 3
#> address_1 address_2 address_norm
#> <chr> <chr> <chr>
#> 1 275 7th Ave. <NA> 275 7TH AVE
#> 2 155 Yorba St <NA> 155 YORBA ST
#> 3 1630 S Commerce St <NA> 1630 S COMMERCE ST
#> 4 4395 Polaris Ave <NA> 4395 POLARIS AVE
#> 5 1224 Western Ave <NA> 1224 WESTERN AVE
#> 6 <NA> <NA> <NA>
#> 7 4206 S. 108th St. <NA> 4206 S 108TH ST
#> 8 701 Great Basin Blvd. <NA> 701 GREAT BASIN BLVD
#> 9 7801 N. Lamar Blvd. Suite D-106 7801 N LAMAR BLVD SUITE D106
#> 10 1689 E. Great Basin Hiway <NA> 1689 E GREAT BASIN HWY
#> # … with 44,193 more rows
all_payees <- left_join(
x = all_payees,
y = addr_norm,
by = c("address_1", "address_2")
)
For ZIP codes, the campfin::normal_zip()
function will attempt to
create valid five digit codes by removing the ZIP+4 suffix and
returning leading zeroes dropped by other programs like Microsoft Excel.
all_payees <- all_payees %>%
mutate(
zip_norm = normal_zip(
zip = zip,
na_rep = TRUE
)
)
progress_table(
all_payees$zip,
all_payees$zip_norm,
compare = valid_zip
)
#> # A tibble: 2 × 6
#> stage prop_in n_distinct prop_na n_out n_diff
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 all_payees$zip 0.939 5914 0.00264 3751 2393
#> 2 all_payees$zip_norm 0.988 4056 0.00776 714 418
Valid two digit state abbreviations can be made using the
campfin::normal_state()
function.
all_payees <- all_payees %>%
mutate(
state_norm = normal_state(
state = state,
abbreviate = TRUE,
na_rep = TRUE
)
)
all_payees %>%
filter(state != state_norm) %>%
count(state, state_norm, sort = TRUE)
#> # A tibble: 71 × 3
#> state state_norm n
#> <chr> <chr> <int>
#> 1 Nv NV 792
#> 2 nv NV 695
#> 3 Ca CA 45
#> 4 Ne NE 43
#> 5 Ut UT 23
#> 6 ca CA 20
#> 7 Tx TX 17
#> 8 Il IL 15
#> 9 ut UT 14
#> 10 Fl FL 13
#> # … with 61 more rows
progress_table(
all_payees$state,
all_payees$state_norm,
compare = valid_state
)
#> # A tibble: 2 × 6
#> stage prop_in n_distinct prop_na n_out n_diff
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 all_payees$state 0.964 200 0.00261 2215 144
#> 2 all_payees$state_norm 0.998 100 0.00756 120 44
Cities are the most difficult geographic variable to normalize, simply due to the wide variety of valid cities and formats.
The campfin::normal_city()
function is a good start, again converting
case, removing punctuation, but expanding USPS abbreviations. We can
also remove invalid_city
values.
norm_city <- all_payees %>%
distinct(city, state_norm, zip_norm) %>%
mutate(
city_norm = normal_city(
city = city,
abbs = usps_city,
states = c("NV", "DC", "NEVADA"),
na = invalid_city,
na_rep = TRUE
)
)
We can further improve normalization by comparing our normalized value against the expected value for that record’s state abbreviation and ZIP code. If the normalized value is either an abbreviation for or very similar to the expected value, we can confidently swap those two.
norm_city <- norm_city %>%
rename(city_raw = city) %>%
left_join(
y = zipcodes,
by = c(
"state_norm" = "state",
"zip_norm" = "zip"
)
) %>%
rename(city_match = city) %>%
mutate(
match_abb = is_abbrev(city_norm, city_match),
match_dist = str_dist(city_norm, city_match),
city_swap = if_else(
condition = !is.na(match_dist) & (match_abb | match_dist == 1),
true = city_match,
false = city_norm
)
) %>%
select(
-city_match,
-match_dist,
-match_abb
)
all_payees <- left_join(
x = all_payees,
y = norm_city,
by = c(
"city" = "city_raw",
"state_norm",
"zip_norm"
)
)
The OpenRefine algorithms can be used to group similar strings and replace the less common versions with their most common counterpart. This can greatly reduce inconsistency, but with low confidence; we will only keep any refined strings that have a valid city/state/zip combination.
good_refine <- all_payees %>%
mutate(
city_refine = city_swap %>%
key_collision_merge() %>%
n_gram_merge(numgram = 1)
) %>%
filter(city_refine != city_swap) %>%
inner_join(
y = zipcodes,
by = c(
"city_refine" = "city",
"state_norm" = "state",
"zip_norm" = "zip"
)
)
#> # A tibble: 17 × 5
#> state_norm zip_norm city_swap city_refine n
#> <chr> <chr> <chr> <chr> <int>
#> 1 NV 89103 LAS VEGASAS LAS VEGAS 3
#> 2 NV 89410 GARDENVILLE GARDNERVILLE 3
#> 3 NV 89020 ARMAGOSA VALLEY AMARGOSA VALLEY 2
#> 4 NV 89448 ZYPHER COVE ZEPHYR COVE 2
#> 5 CA 94103 SAN FRANSICO SAN FRANCISCO 1
#> 6 CA 94103 SANFRANSCICO SAN FRANCISCO 1
#> 7 IA 52240 IOWA CITY IA IOWA CITY 1
#> 8 ID 83814 COUER DALENE COEUR D ALENE 1
#> 9 KY 40205 LOUSIEVILLE LOUISVILLE 1
#> 10 NV 89021 LONGDALE LOGANDALE 1
#> 11 NV 89043 PICHOE PIOCHE 1
#> 12 NV 89316 EEUREKAUREKA EUREKA 1
#> 13 NV 89410 GARDENERVILL GARDNERVILLE 1
#> 14 NV 89410 GARDNERVVILL GARDNERVILLE 1
#> 15 NY 12010 AMERSTERDAM AMSTERDAM 1
#> 16 OH 45241 CINCINATTI CINCINNATI 1
#> 17 SC 29406 NORTH CHARLESTON CHARLESTON 1
Then we can join the refined values back to the database.
all_payees <- all_payees %>%
left_join(good_refine, by = names(.)) %>%
mutate(city_refine = coalesce(city_refine, city_swap))
nve <- left_join(
x = nve,
y = all_payees,
by = c("payee_id" = "contact_id")
)
Our goal for normalization was to increase the proportion of city values known to be valid and reduce the total distinct values by correcting misspellings.
stage | prop_in | n_distinct | prop_na | n_out | n_diff |
---|---|---|---|---|---|
str_to_upper(nve$city) |
0.975 | 2219 | 0.001 | 8451 | 617 |
nve$city_norm |
0.983 | 2103 | 0.003 | 5695 | 492 |
nve$city_swap |
0.996 | 1870 | 0.003 | 1494 | 250 |
nve$city_refine |
0.996 | 1856 | 0.003 | 1430 | 238 |
You can see how the percentage of valid values increased with each stage.
More importantly, the number of distinct values decreased each stage. We were able to confidently change many distinct invalid values to their valid equivalent.
Before exporting, we can remove the intermediary normalization columns
and rename all added variables with the _clean
suffix.
nve <- nve %>%
select(
-city_norm,
-city_swap,
city_clean = city_refine
) %>%
rename_all(~str_replace(., "_norm", "_clean")) %>%
rename_all(~str_remove(., "_raw")) %>%
relocate(address_clean, city_clean, state_clean, .before = zip_clean)
There are 334,607 rows of 21 columns. Each record represents a single expenditure between a committee and an outside vendor.
glimpse(nve)
#> Rows: 334,607
#> Columns: 21
#> $ expense_id <int> 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 10, 11, 12, 13, 15, 16, 16, 17, 17, 18, 19, 20, 21…
#> $ report_id <int> 6980, 6980, 6980, 6980, 6980, 6980, 6980, 6980, 6980, 6980, 6983, 6983, 6983, 6983, 6985, 6985,…
#> $ recipient_id <int> 1220, 1220, 1220, 1220, 1220, 1220, 1220, 1220, 1220, 1220, 1332, 1332, 1332, 1332, 1165, 1165,…
#> $ recipient_name <chr> "Gary Perea", "Culinary Workers Union Local 226", "Gary Perea", "Culinary Workers Union Local 2…
#> $ recipient_type <chr> "Candidate", "Independent Expenditure", "Candidate", "Independent Expenditure", "Candidate", "I…
#> $ expense_date <date> 2006-07-05, 2006-07-05, 2006-07-31, 2006-07-31, 2006-07-25, 2006-07-25, 2006-07-05, 2006-07-05…
#> $ expense_amount <dbl> 1688.12, 1688.12, 52970.78, 52970.78, 911.40, 911.40, 1764.33, 1764.33, 107.75, 107.75, 0.00, 0…
#> $ expense_type <chr> "Monetary Expense", "Monetary Expense", "Monetary Expense", "Monetary Expense", "Monetary Expen…
#> $ payee_id <int> 4, 4, 5, 5, 6, 6, 7, 7, 7, 7, 9, 9, 9, 9, 10, 11, 12, 13, 14, 16, 20, 20, 24, 24, 66, 67, 68, 6…
#> $ first_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "none",…
#> $ middle_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ last_name <chr> "The Calvert Company", "The Calvert Company", "Paid Staff", "Paid Staff", "Passkey Systems", "P…
#> $ address_1 <chr> "155 Yorba St", "155 Yorba St", "1630 S Commerce St", "1630 S Commerce St", "4395 Polaris Ave",…
#> $ address_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Suite D-106", NA, "Suite 4", N…
#> $ city <chr> "Tustin", "Tustin", "Las Vegas", "Las Vegas", "Las Vegas", "Las Vegas", "Las Vegas", "Las Vegas…
#> $ state <chr> "CA", "CA", "NV", "NV", "NV", "NV", "NV", "NV", "NV", "NV", NA, NA, NA, NA, "NE", "NV", "TX", "…
#> $ zip <chr> "92780", "92780", "89102", "89102", "89103", "89103", "89102", "89102", "89102", "89102", NA, N…
#> $ address_clean <chr> "155 YORBA ST", "155 YORBA ST", "1630 S COMMERCE ST", "1630 S COMMERCE ST", "4395 POLARIS AVE",…
#> $ city_clean <chr> "TUSTIN", "TUSTIN", "LAS VEGAS", "LAS VEGAS", "LAS VEGAS", "LAS VEGAS", "LAS VEGAS", "LAS VEGAS…
#> $ state_clean <chr> "CA", "CA", "NV", "NV", "NV", "NV", "NV", "NV", "NV", "NV", NA, NA, NA, NA, "NE", "NV", "TX", "…
#> $ zip_clean <chr> "92780", "92780", "89102", "89102", "89103", "89103", "89102", "89102", "89102", "89102", NA, N…
tail(nve)
#> # A tibble: 6 × 21
#> expense_id report_id recipient_id recipient_name recipient_type expense_date expense_amount expense_type payee_id
#> <int> <int> <int> <chr> <chr> <date> <dbl> <chr> <int>
#> 1 419250 119816 1210 Washoe County Rep… Political Part… 2021-07-26 432. Monetary Ex… 404571
#> 2 419251 119816 1210 Washoe County Rep… Political Part… 2021-07-26 410 Monetary Ex… 404572
#> 3 419252 119816 1210 Washoe County Rep… Political Part… 2021-09-28 400 Monetary Ex… 404572
#> 4 419253 119816 1210 Washoe County Rep… Political Part… 2021-08-19 2255. Monetary Ex… 163103
#> 5 419260 120826 3730 Battle Born Women… Political Acti… 2021-07-08 20000 Monetary Ex… 203558
#> 6 419261 120834 2289 Las Vegas PMSA Po… Political Acti… 2021-07-13 1638. Monetary Ex… 404580
#> # … with 12 more variables: first_name <chr>, middle_name <chr>, last_name <chr>, address_1 <chr>, address_2 <chr>,
#> # city <chr>, state <chr>, zip <chr>, address_clean <chr>, city_clean <chr>, state_clean <chr>, zip_clean <chr>
There are no columns missing values.
col_stats(nve, count_na)
#> # A tibble: 21 × 4
#> col class n p
#> <chr> <chr> <int> <dbl>
#> 1 expense_id <int> 0 0
#> 2 report_id <int> 0 0
#> 3 recipient_id <int> 0 0
#> 4 recipient_name <chr> 0 0
#> 5 recipient_type <chr> 0 0
#> 6 expense_date <date> 0 0
#> 7 expense_amount <dbl> 0 0
#> 8 expense_type <chr> 0 0
#> 9 payee_id <int> 0 0
#> 10 first_name <chr> 278335 0.832
#> 11 middle_name <chr> 328770 0.983
#> 12 last_name <chr> 16 0.0000478
#> 13 address_1 <chr> 766 0.00229
#> 14 address_2 <chr> 261618 0.782
#> 15 city <chr> 244 0.000729
#> 16 state <chr> 417 0.00125
#> 17 zip <chr> 801 0.00239
#> 18 address_clean <chr> 839 0.00251
#> 19 city_clean <chr> 912 0.00273
#> 20 state_clean <chr> 1767 0.00528
#> 21 zip_clean <chr> 1925 0.00575
We can also flag any record completely duplicated across every column.
nve <- flag_dupes(nve, -expense_id)
sum(nve$dupe_flag)
#> [1] 12269
key_vars <- c("last_name", "expense_date", "expense_amount", "recipient_name")
nve %>%
filter(dupe_flag) %>%
select(expense_id, report_id, all_of(key_vars))
#> # A tibble: 12,269 × 6
#> expense_id report_id last_name expense_date expense_amount recipient_name
#> <int> <int> <chr> <date> <dbl> <chr>
#> 1 36 6991 Bonanza 2006-08-01 731. E Tiras
#> 2 37 6991 Bonanza 2006-08-01 731. E Tiras
#> 3 56 7062 NONE 2006-01-01 0 Patricia Herzog
#> 4 56 7062 NONE 2006-01-01 0 ACME Enterprises PAC
#> 5 57 7062 NONE 2006-01-01 0 Patricia Herzog
#> 6 57 7062 NONE 2006-01-01 0 ACME Enterprises PAC
#> 7 319 7577 McLeod 2006-09-21 -315. Lynette Boggs McDonald
#> 8 320 7577 McLeod 2006-09-21 -315. Lynette Boggs McDonald
#> 9 321 7577 McLeod 2006-09-21 -341 Lynette Boggs McDonald
#> 10 322 7577 McLeod 2006-09-21 -326. Lynette Boggs McDonald
#> # … with 12,259 more rows
col_stats(nve, n_distinct)
#> # A tibble: 22 × 4
#> col class n p
#> <chr> <chr> <int> <dbl>
#> 1 expense_id <int> 310302 0.927
#> 2 report_id <int> 18516 0.0553
#> 3 recipient_id <int> 3173 0.00948
#> 4 recipient_name <chr> 3283 0.00981
#> 5 recipient_type <chr> 7 0.0000209
#> 6 expense_date <date> 5359 0.0160
#> 7 expense_amount <dbl> 57992 0.173
#> 8 expense_type <chr> 2 0.00000598
#> 9 payee_id <int> 61376 0.183
#> 10 first_name <chr> 3265 0.00976
#> 11 middle_name <chr> 429 0.00128
#> 12 last_name <chr> 36860 0.110
#> 13 address_1 <chr> 40668 0.122
#> 14 address_2 <chr> 4384 0.0131
#> 15 city <chr> 2787 0.00833
#> 16 state <chr> 200 0.000598
#> 17 zip <chr> 5914 0.0177
#> 18 address_clean <chr> 35563 0.106
#> 19 city_clean <chr> 1856 0.00555
#> 20 state_clean <chr> 100 0.000299
#> 21 zip_clean <chr> 4056 0.0121
#> 22 dupe_flag <lgl> 2 0.00000598
summary(nve$expense_amount)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -60000 82 307 3276 1194 19071790
mean(nve$expense_amount <= 0)
#> [1] 0.002014303
These are the records with the minimum and maximum amounts.
glimpse(nve[c(
which.max(nve$expense_amount), which.min(nve$expense_amount)
), ])
#> Rows: 2
#> Columns: 22
#> $ expense_id <int> 283456, 306225
#> $ report_id <int> 80598, 83811
#> $ recipient_id <int> 3708, 3759
#> $ recipient_name <chr> "Jeffrey Bobeck", "AFSCME Nevada"
#> $ recipient_type <chr> "Candidate", "Political Action Committee"
#> $ expense_date <date> 2018-09-21, 2018-10-16
#> $ expense_amount <dbl> 19071790, -60000
#> $ expense_type <chr> "Monetary Expense", "Monetary Expense"
#> $ payee_id <int> 268320, 297581
#> $ first_name <chr> NA, NA
#> $ middle_name <chr> NA, NA
#> $ last_name <chr> "Winner & Mandabach Campaigns", "Havas Media Group LLC"
#> $ address_1 <chr> "100 Wilshire Boulevard, Suite 2040", "1875 Connecticut Ave, NW"
#> $ address_2 <chr> NA, "10th Floor"
#> $ city <chr> "Santa Monica", "Washington"
#> $ state <chr> "CA", "DC"
#> $ zip <chr> "90401", "20009"
#> $ address_clean <chr> "100 WILSHIRE BOULEVARD SUITE 2040", "1875 CONNECTICUT AVE NW 10TH FL"
#> $ city_clean <chr> "SANTA MONICA", "WASHINGTON"
#> $ state_clean <chr> "CA", "DC"
#> $ zip_clean <chr> "90401", "20009"
#> $ dupe_flag <lgl> FALSE, FALSE
We can add the calendar year from expense_date
with
lubridate::year()
nve <- mutate(nve, expense_year = year(expense_date))
min(nve$expense_date)
#> [1] "2006-01-01"
max(nve$expense_date)
#> [1] "2021-09-30"
sum(nve$expense_date > today())
#> [1] 0
glimpse(sample_n(nve, 50))
#> Rows: 50
#> Columns: 23
#> $ expense_id <int> 75948, 1984, 285407, 172284, 4365, 122949, 283219, 266530, 339737, 330076, 395172, 166399, 2196…
#> $ report_id <int> 40474, 17104, 82251, 58270, 19246, 48868, 81996, 76316, 95021, 89310, 110970, 57729, 67243, 510…
#> $ recipient_id <int> 419, 3306, 6796, 5441, 1654, 3957, 529, 1137, 11180, 5828, 6967, 1900, 3618, 3546, 6796, 1166, …
#> $ recipient_name <chr> "Chris Giunchigliani", "John Breternitz", "Zachary Conine", "Stephen Silberkraus", "STAR PAC", …
#> $ recipient_type <chr> "Candidate", "Candidate", "Candidate", "Candidate", "Political Action Committee", "Candidate", …
#> $ expense_date <date> 2013-09-17, 2007-07-23, 2018-09-10, 2014-08-21, 2008-10-03, 2014-09-15, 2018-06-08, 2018-05-15…
#> $ expense_amount <dbl> 100.00, 2645.00, 38.96, 25.00, 8372.00, 2500.00, 750.00, 124.50, 500.00, 200.00, 2.30, 250.00, …
#> $ expense_type <chr> "Monetary Expense", "Monetary Expense", "Monetary Expense", "Monetary Expense", "Monetary Expen…
#> $ payee_id <int> 19377, 2306, 158492, 143814, 206, 112973, 242967, 53930, 299149, 336102, 374793, 28076, 171578,…
#> $ first_name <chr> NA, NA, NA, NA, NA, "Carissa", NA, NA, NA, "Edward", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ middle_name <chr> NA, NA, NA, NA, NA, "A", NA, NA, NA, "T.", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ last_name <chr> "Nevada Advocated for Planned Parenthood", "JEFF DOW PHOTOGRAPHY", "A&B", "Chevron", "KOLO", "B…
#> $ address_1 <chr> "550 W Plumb Lane", "8543 WHITE FIR ST", "2900 South Highland Bldg. 18, Ste B", "1420 Paseo Ver…
#> $ address_2 <chr> "Suite #B104", "SUITE D-3", NA, NA, NA, NA, NA, NA, "#51", NA, "Suite 1770", NA, NA, NA, NA, NA…
#> $ city <chr> "Reno", "RENO", "Las Vegas", "Henderson", "RENO", "BOULDER CITY", "West Wendover", "RENO", "Las…
#> $ state <chr> "NV", "NV", "NV", "NV", "NV", "NV", "NV", "NV", "NV", "NV", "LA", "NV", "VA", "WA", "DC", "NV",…
#> $ zip <chr> "89509", "89523", "89109", "89012", "89510", "89005", "89883", "89512", "89139", "89015", "7011…
#> $ address_clean <chr> "550 W PLUMB LN SUITE #B104", "8543 WHITE FIR ST SUITE D3", "2900 SOUTH HIGHLAND BLDG 18 STE B"…
#> $ city_clean <chr> "RENO", "RENO", "LAS VEGAS", "HENDERSON", "RENO", "BOULDER CITY", "WEST WENDOVER", "RENO", "LAS…
#> $ state_clean <chr> "NV", "NV", "NV", "NV", "NV", "NV", "NV", "NV", "NV", "NV", "LA", "NV", "VA", "WA", "DC", "NV",…
#> $ zip_clean <chr> "89509", "89523", "89109", "89012", "89510", "89005", "89883", "89512", "89139", "89015", "7011…
#> $ dupe_flag <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE,…
#> $ expense_year <dbl> 2013, 2007, 2018, 2014, 2008, 2014, 2018, 2018, 2019, 2019, 2020, 2015, 2016, 2014, 2019, 2017,…
- There are 334,607 records in the database.
- There are 12,269 duplicate records in the database.
- The range and distribution of
amount
anddate
seem reasonable. - There are 0 records missing key variables.
- Consistency in geographic data has been improved with
campfin::normal_*()
. - The 4-digit
year
variable has been created withlubridate::year()
.
Now the file can be saved on disk for upload to the Accountability server. We will name the object using a date range of the records included.
min_dt <- str_remove_all(min(nve$expense_date), "-")
max_dt <- str_remove_all(max(nve$expense_date), "-")
csv_ts <- paste(min_dt, max_dt, sep = "-")
clean_dir <- dir_create(here("nv", "expends", "data", "clean"))
clean_csv <- path(clean_dir, glue("nv_expends_{csv_ts}.csv"))
clean_rds <- path_ext_set(clean_csv, "rds")
write_csv(nve, clean_csv, na = "")
write_rds(nve, clean_rds, compress = "xz")
(clean_size <- file_size(clean_csv))
#> 64.8M
We can use the aws.s3::put_object()
to upload the text file to the IRW
server.
aws_key <- path("csv", basename(clean_csv))
if (!object_exists(aws_csv, "publicaccountability")) {
put_object(
file = clean_csv,
object = aws_key,
bucket = "publicaccountability",
acl = "public-read",
show_progress = TRUE,
multipart = TRUE
)
}
aws_head <- head_object(aws_csv, "publicaccountability")
(aws_size <- as_fs_bytes(attr(aws_head, "content-length")))
unname(aws_size == clean_size)