Kiernan Nicholls Sun Mar 26 15:37:18 2023
The Accountability Project is an effort to cut across data silos and give journalists, policy professionals, activists, and the public at large a simple way to search across huge volumes of public data about people and organizations.
Our goal is to standardize public data on a few key fields by thinking of each dataset row as a transaction. For each transaction there should be (at least) 3 variables:
- All parties to a transaction.
- The date of the transaction.
- The amount of money involved.
This document describes the process used to complete the following objectives:
- How many records are in the database?
- Check for entirely duplicated records.
- Check ranges of continuous variables.
- Is there anything blank or missing?
- Check for consistency issues.
- Create a five-digit ZIP Code called
zip
. - Create a
year
field from the transaction date. - Make sure there is data on both parties to a transaction.
The following packages are needed to collect, manipulate, visualize,
analyze, and communicate these results. The pacman
package will
facilitate their installation and attachment.
if (!require("pacman")) {
install.packages("pacman")
}
pacman::p_load(
tidyverse, # data manipulation
lubridate, # datetime strings
jsonlite, # parse json data
gluedown, # printing markdown
janitor, # clean data frames
campfin, # custom irw tools
aws.s3, # aws cloud storage
refinr, # cluster & merge
scales, # format strings
knitr, # knit documents
vroom, # fast reading
rvest, # scrape html
glue, # code strings
here, # project paths
httr, # http requests
fs # local storage
)
This document should be run as part of the R_campfin
project, which
lives as a sub-directory of the more general, language-agnostic
irworkshop/accountability_datacleaning
GitHub repository.
The R_campfin
project uses the RStudio
projects
feature and should be run as such. The project also uses the dynamic
here::here()
tool for file paths relative to your machine.
# where does this document knit?
here::i_am("wa/contribs/docs/wa_contribs_diary.Rmd")
wa_meta <- fromJSON("https://data.wa.gov/api/views/kv7h-kjye")
Contributions for the state of Washington can be obtained from the state’s Public Disclosure Commission on their Open Data portal. The data is titled “Contributions to Candidates and Political Committees” and was created on December 16, 2016 and was last updated at March 26, 2023.
This dataset contains cash and in-kind contributions, (including unpaid loans) made to Washington State Candidates and Political Committees for the last 10 years as reported to the PDC on forms C3, C4, Schedule C and their electronic filing equivalents. It does not include loans which have been paid or forgiven, pledges or any expenditures.
For candidates, the number of years is determined by the year of the election, not necessarily the year the contribution was reported. For political committees, the number of years is determined by the calendar year of the reporting period.
Candidates and political committees choosing to file under “mini reporting” are not included in this dataset. See WAC 390-16-105 for information regarding eligibility.
This dataset is a best-effort by the PDC to provide a complete set of records as described herewith and may contain incomplete or incorrect information. The PDC provides access to the original reports for the purpose of record verification.
Descriptions attached to this dataset do not constitute legal definitions; please consult RCW 42.17A and WAC Title 390 for legal definitions and additional information political finance disclosure requirements.
CONDITION OF RELEASE: This publication constitutes a list of individuals prepared by the Washington State Public Disclosure Commission and may not be used for commercial purposes. This list is provided on the condition and with the understanding that the persons receiving it agree to this statutorily imposed limitation on its use. See RCW 42.56.070(9) and AGO 1975 No. 15.
fieldName | dataTypeName | description |
---|---|---|
id |
text | PDC internal identifier that corresponds to a single contribution or correction record. When combined with the origin value, this number uniquely identifies a single row. |
report_number |
text | PDC identifier used for tracking the individual form C3, C4 or C4 schedule C. Multiple contributions or corrections will correspond to the same report number when they were reported to the PDC at the same time. The report number is unique to the report it represents. When a report is amended, a new report number is assigned that supersedes the original version of the amended report and the original report records are not included in this dataset. |
origin |
text | The form, schedule or section where the record was reported. Please see https://www.pdc.wa.gov/learn/forms for a list of forms and instructions. AUB (Form C3, schedule AU, auction buyer); AUD (Form C3, schedule AU, auction donor); C.1 (Form C4, schedule C1 correction); C3 (Form C3 cash contribution); C3.1A (Form C3, anonymous contributions); C3.1B (Form C3, candidate personal funds); C3.1D (Form C3, miscellaneous receipts) C3.1E (Form C3, small contributions); C4 (Form C4, in-kind contribution); |
committee_id |
text | The unique identifier of a committee. For a continuing committee, this id will be the same for all the years that the committee is registered. Single year committees and candidate committees will have a unique id for each year even though the candidate or committee organization might be the same across years. Surplus accounts will have a single committee id across all years. |
filer_id |
text | The unique id assigned to a candidate or political committee. The filer id is consistent across election years with the exception that an individual running for a second office in the same election year will receive a second filer id. There is no correlation between the two filer ids. For a candidate and single-election-year committee such as a ballot committee, the combination of filer_id and election_year uniquely identifies a campaign. |
type |
text | Indicates if this record is for a candidate or a political committee. In the case of a political committee, it may be either a continuing political committee, party committee or single election year committee. |
filer_name |
text | The candidate or committee name as reported on the form C1 candidate or committee registration form. The name will be consistent across all records for the same filer id and election year but may differ across years due to candidates or committees changing their name. |
office |
text | The office sought by the candidate. Does not apply to political committees. |
legislative_district |
text | The Washington State legislative district. This field only applies to candidates where the office is “state senator” or “state representative.” |
position |
text | The position associated with an office. This field typically applies to judicial and local office that have multiple positions or seats. This field does not apply to political committees. |
party |
text | The political party as declared by the candidate or committee on their form C1 registration. Contains only “Major parties” as recognized by Washington State law. |
ballot_number |
text | If the committee is a Statewide Ballot Initiative Committee a ballot number will appear once a ballot number is assigned by the Secretary of State. Local Ballot Initiatives will not have a ballot number. This field will contain a number only if the Secretary of State issues a number. |
for_or_against |
text | Ballot initiative committees are formed to either support or oppose an initiative. This field represents whether a committee “supports” (for) or “opposes” (against) a ballot initiative. |
jurisdiction |
text | The political jurisdiction associated with the office of a candidate. |
jurisdiction_county |
text | The county associated with the jurisdiction of a candidate. Multi-county jurisdictions as reported as the primary county. This field will be empty for political committees and when a candidate jurisdiction is statewide. |
jurisdiction_type |
text | The type of jurisdiction this office is: Statewide, Local, etc. |
election_year |
number | The election year in the case of candidates and single election committees. The reporting year in the case of continuing political committees. |
amount |
number | The amount of the cash or in-kind contribution. On corrections records, this field is the amount of the adjustment. |
cash_or_in_kind |
text | What kind of contribution this is, if known. |
receipt_date |
calendar_date | The date that the contribution was received. |
description |
text | The reported description of the transaction. This field does not apply to cash contributions. Not all in-kind contributions and corrections will contain a description. In the case of corrections, the PDC has added a notation regarding the amounts reported on the form Schedule C. A C3 contains not only detailed contributions it also contains none detailed cash - this column will contain the descriptions: ‘Anonymous - Cash’; ‘Candidates Personal Funds - Does not include candidate loans’; ‘Miscellaneous Receipts’, ‘Contributions of $25 or less contributed by: ____ persons’. |
memo |
text | The optional memo field associated with the transaction. In most cases this field will be blank. |
primary_general |
text | Candidates subject to contribution limits must specify whether a contribution is designated for the primary or the general election. Contributions to candidates not subject to limits, political committees and continuing political committees apply to the full election cycle. |
code |
text | When a contribution is received, the code field denotes the type of entity that made the contribution. These types are determined by the filer. The field values correspond to the selected code: A: Anonymous; B: Business; C: PAC; I: Individual; L: Caucus; O: Other; P: Bona Fide Party; S: Self Pers. Fund; T: Minor Party; U: Union. |
contributor_category |
text | Indicates if the contributor is an “Individual” or “Organization”. When a contribution is received, the code field denotes the type of entity that made the contribution. These types are determined by the filer. The field values correspond to the selected code: A: Anonymous; B: Business; C: PAC; I: Individual; L: Caucus; O: Other; P: Bona Fide Party; S: Self Pers. Fund; T: Minor Party; U: Union. Codes B, C, F, L, O, P, M and U are assigned the category “Organization” and all others are assigned the category “Individual”. |
contributor_name |
text | The name of the individual or organization making the contribution as reported. The names appearing here have not been normalized and the same entity may be represented by different names in the dataset. This field only applies to contributions where the aggregate total of all contributions from this contributor to this candidate or committee exceeds $25 for the election cycle (calendar year for continuing committees). |
contributor_address |
text | The street address of the individual or organization making the contribution. Refer to the contributor name field for more information on when this information is required. |
contributor_city |
text | The city of the individual or organization making the contribution. Refer to the contributor name field for more information on when this information is required. |
contributor_state |
text | The state of the individual or organization making the contribution. Contributions from outside the United States may contain foreign postal region codes in this field. Refer to the contributor name field for more information on when this information is required. |
contributor_zip |
text | The US zip code of the individual or organization making the contribution. Contributions from outside the United States may contain foreign postal codes in this field. Refer to the contributor name field for more information on when this information is required. |
contributor_occupation |
text | The occupation of the contributor. This field only applies to contributions by individuals and only when an individual gives a campaign or committee more than $100 in the aggregate for the election cycle (calendar year for continuing political committees). |
contributor_employer_name |
text | The name of the contributor’s employer. The names appearing here have not been normalized and the same entity may be represented by different names in the dataset. Refer to the contributor occupation field to see when this field applies. |
contributor_employer_city |
text | City of the contributor’s employer. Refer to the contributor occupation field to see when this field applies. |
contributor_employer_state |
text | State of the contributor’s employer. Refer to the contributor occupation field to see when this field applies. |
url |
url | A link to a PDF version of the original report as it was filed to the PDC. |
contributor_location |
point | The geocoded location of the contributor as reported. The quality of the geocoded location is dependent on how many of the address fields are available and is calculated using a third-party service. The PDC has not verified the results of the geocoding. Please refer to the recipient_name field for more information regarding address fields. |
raw_dir <- dir_create(here("wa", "contribs", "data", "raw"))
raw_tsv <- path(raw_dir, path_ext_set(wa_meta$resourceName, "tsv"))
if (!file_exists(raw_tsv)) {
wa_head <- GET(
url = "https://data.wa.gov/api/views/kv7h-kjye/rows.tsv",
query = list(accessType = "DOWNLOAD"),
write_disk(path = raw_tsv),
progress("down")
)
}
wac <- read_delim(
file = raw_tsv,
delim = "\t",
escape_backslash = FALSE,
escape_double = FALSE,
col_types = cols(
.default = col_character(),
id = col_integer(),
election_year = col_integer(),
amount = col_double(),
receipt_date = col_date("%m/%d/%Y")
)
)
There are 5,952,878 rows of 36 columns. Each record represents a single cash or in-kind contribution or correction.
glimpse(wac)
#> Rows: 5,952,878
#> Columns: 36
#> $ id <int> 16417211, 16417210, 16417209, 16417208, 16417207, 16417226, 16417240, 16417279, 164…
#> $ report_number <chr> "110108190", "110108190", "110108190", "110108190", "110108190", "110108192", "1101…
#> $ origin <chr> "C3", "C3", "C3", "C3", "C3", "C3", "C3", "C3", "C3", "C3", "C3", "C3", "C3", "C3",…
#> $ committee_id <chr> "30904", "30904", "30904", "30904", "30904", "29766", "30614", "15022", "30614", "3…
#> $ filer_id <chr> "BHAGH 570", "BHAGH 570", "BHAGH 570", "BHAGH 570", "BHAGH 570", "HALLM 516",…
#> $ type <chr> "Candidate", "Candidate", "Candidate", "Candidate", "Candidate", "Candidate", "Cand…
#> $ filer_name <chr> "Harry O. Bhagwandin", "Harry O. Bhagwandin", "Harry O. Bhagwandin", "Harry O. Bhag…
#> $ office <chr> "COUNTY COMMISSIONER", "COUNTY COMMISSIONER", "COUNTY COMMISSIONER", "COUNTY COMMIS…
#> $ legislative_district <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "26", NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ position <chr> "County Commissioner Position 3", "County Commissioner Position 3", "County Commiss…
#> $ party <chr> "REPUBLICAN", "REPUBLICAN", "REPUBLICAN", "REPUBLICAN", "REPUBLICAN", "DEMOCRATIC",…
#> $ ballot_number <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ for_or_against <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ jurisdiction <chr> "LEWIS CO", "LEWIS CO", "LEWIS CO", "LEWIS CO", "LEWIS CO", "THURSTON CO", "CHELAN …
#> $ jurisdiction_county <chr> "LEWIS", "LEWIS", "LEWIS", "LEWIS", "LEWIS", "THURSTON", "CHELAN", NA, "CHELAN", "C…
#> $ jurisdiction_type <chr> "Local", "Local", "Local", "Local", "Local", "Local", "Local", NA, "Local", "Local"…
#> $ election_year <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2010, 2022, 2022,…
#> $ amount <dbl> 500.0, 500.0, 200.0, 1000.0, 150.0, 10.0, 50.0, 525.0, 250.0, 100.0, 400.0, 50.0, 2…
#> $ cash_or_in_kind <chr> "Cash", "Cash", "Cash", "Cash", "Cash", "Cash", "Cash", "Cash", "Cash", "Cash", "Ca…
#> $ receipt_date <date> 2022-08-10, 2022-08-10, 2022-08-10, 2022-08-09, 2022-08-09, 2022-08-19, 2022-08-15…
#> $ description <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Contributions of $25 or less contr…
#> $ memo <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ primary_general <chr> "General", "General", "General", "General", "General", "General", "General", "Full …
#> $ code <chr> "Individual", "Individual", "Individual", "Individual", "Individual", "Individual",…
#> $ contributor_category <chr> "Individual", "Individual", "Individual", "Individual", "Individual", "Individual",…
#> $ contributor_name <chr> "M REMUND SUSAN", "J REMUND RENE", "RUSSELL ROBERT", "M FENN TERESA", "D KRABBE PET…
#> $ contributor_address <chr> "213 Boistfort Road", "213 Boistfort Road", "PO Box 902", "126 Hideaway Lane", "PO …
#> $ contributor_city <chr> "Chehalis", "Chehalis", "Chehalis", "Packwood", "Randle", "LACEY", "Leavenworth", "…
#> $ contributor_state <chr> "WA", "WA", "WA", "WA", "WA", "WA", "WA", "WA", "WA", "WA", "WA", "WA", NA, "WA", "…
#> $ contributor_zip <chr> "98532", "98532", "98532", "98361", "98377", "98516", "98826", "98516", "98826", "9…
#> $ contributor_occupation <chr> "RETIRED", "RETIRED", "RETIRED", "RETIRED", "RETIRED", NA, NA, NA, "RETIRED", NA, "…
#> $ contributor_employer_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, "N/A", NA, "Snowgrass Lodge", "RETIRED", NA, NA, NA…
#> $ contributor_employer_city <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Leavenworth", "VAUGHN", NA, NA, NA, NA, NA…
#> $ contributor_employer_state <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "WA", "WA", NA, NA, NA, NA, NA, NA, "WA", N…
#> $ url <chr> "https://apollo.pdc.wa.gov/public/registrations/campaign-finance-report/110108190",…
#> $ contributor_location <chr> NA, NA, "POINT (-122.96691 46.66112)", NA, NA, "POINT (-122.76746 47.10356)", NA, "…
tail(wac)
#> # A tibble: 6 × 36
#> id report_number origin committee_id filer_id type filer_name office legislative_district position party
#> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 17145908 110139988 C3 31451 SIMMJ--210 Candidate Justin Fl… CITY … <NA> CITY CO… <NA>
#> 2 17145907 110139988 C3 31451 SIMMJ--210 Candidate Justin Fl… CITY … <NA> CITY CO… <NA>
#> 3 17145906 110139988 C3 31451 SIMMJ--210 Candidate Justin Fl… CITY … <NA> CITY CO… <NA>
#> 4 17145905 110139988 C3 31451 SIMMJ--210 Candidate Justin Fl… CITY … <NA> CITY CO… <NA>
#> 5 17145904 110139988 C3 31451 SIMMJ--210 Candidate Justin Fl… CITY … <NA> CITY CO… <NA>
#> 6 17145922 110139989 C3 31906 FIREC--701 Political… Committee… <NA> <NA> <NA> <NA>
#> # ℹ 25 more variables: ballot_number <chr>, for_or_against <chr>, jurisdiction <chr>, jurisdiction_county <chr>,
#> # jurisdiction_type <chr>, election_year <int>, amount <dbl>, cash_or_in_kind <chr>, receipt_date <date>,
#> # description <chr>, memo <chr>, primary_general <chr>, code <chr>, contributor_category <chr>,
#> # contributor_name <chr>, contributor_address <chr>, contributor_city <chr>, contributor_state <chr>,
#> # contributor_zip <chr>, contributor_occupation <chr>, contributor_employer_name <chr>,
#> # contributor_employer_city <chr>, contributor_employer_state <chr>, url <chr>, contributor_location <chr>
Columns vary in their degree of missing values.
col_stats(wac, count_na)
#> # A tibble: 36 × 4
#> col class n p
#> <chr> <chr> <int> <dbl>
#> 1 id <int> 0 0
#> 2 report_number <chr> 0 0
#> 3 origin <chr> 0 0
#> 4 committee_id <chr> 0 0
#> 5 filer_id <chr> 0 0
#> 6 type <chr> 0 0
#> 7 filer_name <chr> 0 0
#> 8 office <chr> 3658261 0.615
#> 9 legislative_district <chr> 5248667 0.882
#> 10 position <chr> 5576880 0.937
#> 11 party <chr> 3412007 0.573
#> 12 ballot_number <chr> 5821716 0.978
#> 13 for_or_against <chr> 5610416 0.942
#> 14 jurisdiction <chr> 3458923 0.581
#> 15 jurisdiction_county <chr> 4216294 0.708
#> 16 jurisdiction_type <chr> 3548596 0.596
#> 17 election_year <int> 0 0
#> 18 amount <dbl> 0 0
#> 19 cash_or_in_kind <chr> 0 0
#> 20 receipt_date <date> 2018 0.000339
#> 21 description <chr> 5622182 0.944
#> 22 memo <chr> 5888660 0.989
#> 23 primary_general <chr> 282651 0.0475
#> 24 code <chr> 0 0
#> 25 contributor_category <chr> 0 0
#> 26 contributor_name <chr> 0 0
#> 27 contributor_address <chr> 216802 0.0364
#> 28 contributor_city <chr> 213707 0.0359
#> 29 contributor_state <chr> 206114 0.0346
#> 30 contributor_zip <chr> 219579 0.0369
#> 31 contributor_occupation <chr> 4028855 0.677
#> 32 contributor_employer_name <chr> 4262670 0.716
#> 33 contributor_employer_city <chr> 4391052 0.738
#> 34 contributor_employer_state <chr> 4391679 0.738
#> 35 url <chr> 0 0
#> 36 contributor_location <chr> 3757707 0.631
We can flag any record missing a key variable needed to identify a transaction.
key_vars <- c("receipt_date", "contributor_name", "amount", "filer_name")
wac <- flag_na(wac, all_of(key_vars))
sum(wac$na_flag)
#> [1] 2018
wac %>%
filter(na_flag) %>%
select(all_of(key_vars))
#> # A tibble: 2,018 × 4
#> receipt_date contributor_name amount filer_name
#> <date> <chr> <dbl> <chr>
#> 1 NA SMALL CONTRIBUTIONS 25 18TH LEG DIST DEMO CENT COMM NON EXEMPT
#> 2 NA SMALL CONTRIBUTIONS 40.2 WASHINGTON EDUCATION ASSOCIATION POLITICAL ACTION COMMITTEE
#> 3 NA SMALL CONTRIBUTIONS 78.7 ALICIA J RULE (Alicia Rule)
#> 4 NA SMALL CONTRIBUTIONS 31.3 Matthew T. Macklin (Matthew Macklin)
#> 5 NA SMALL CONTRIBUTIONS 24.3 Sandra Kaiser
#> 6 NA SMALL CONTRIBUTIONS 65 Sue W. Forde (Sue Forde)
#> 7 NA SMALL CONTRIBUTIONS 230. 48TH DISTRICT DEMOCRATIC ORGANIZATION
#> 8 NA SMALL CONTRIBUTIONS 70 Steven L. Saunders (Steven L Saunders)
#> 9 NA SMALL CONTRIBUTIONS 135 Brian L Pruiett (Brian Pruiett)
#> 10 NA SMALL CONTRIBUTIONS 149 Brian James King (Brian King)
#> # ℹ 2,008 more rows
We can also flag any entirely duplicate rows. To keep memory usage low
with such a large data frame, we will split our data into a list and
check each element of the list. For each chunk, we will write the
duplicate id
to a text file.
prop_distinct(wac$id)
#> [1] 1
dupe_file <- here("il", "contribs", "data", "dupe_ids.txt")
if (!file_exists(dupe_file)) {
tmp <- file_temp(ext = "rds")
write_rds(wac, file = tmp)
file_size(tmp)
wa_id <- split(wac$id, wac$receipt_date)
was <- wac %>%
select(-id) %>%
group_split(receipt_date)
if (file_exists(tmp)) {
rm(wac)
Sys.sleep(5)
flush_memory(2)
}
pb <- txtProgressBar(max = length(was), style = 3)
for (i in seq_along(was)) {
if (nrow(was[[i]]) < 2) {
next
}
d1 <- duplicated(was[[i]], fromLast = FALSE)
d2 <- duplicated(was[[i]], fromLast = TRUE)
dupe_vec <- d1 | d2
rm(d1, d2)
if (any(dupe_vec)) {
write_lines(
x = wa_id[[i]][dupe_vec],
file = dupe_file,
append = file_exists(dupe_file),
na = ""
)
}
rm(dupe_vec)
was[[i]] <- NA
wa_id[[i]] <- NA
if (i %% 100 == 0) {
Sys.sleep(2)
flush_memory()
}
setTxtProgressBar(pb, i)
}
rm(was, wa_id)
Sys.sleep(5)
flush_memory()
wac <- read_rds(tmp)
}
dupe_id <- tibble(
id = as.integer(read_lines(dupe_file, skip_empty_rows = TRUE)),
dupe_flag = TRUE
)
wac <- left_join(wac, dupe_id, by = "id")
wac <- mutate(wac, across(dupe_flag, Negate(is.na)))
0.1% of rows are duplicates.
wac %>%
filter(dupe_flag) %>%
count(receipt_date, contributor_name, amount, filer_name, sort = TRUE)
#> # A tibble: 6,952 × 5
#> receipt_date contributor_name amount filer_name n
#> <date> <chr> <dbl> <chr> <int>
#> 1 2017-11-21 KALISPEL TRIBE OF INDIANS 1000 Mark Schoesler (MARK SCHOESLER) 2
#> 2 2019-12-16 SCOTT PETER 15 CITIZENS FOR EVERETT PUBLIC SCHOOLS 2
#> 3 2020-04-30 ACLU OF WASHINGTON 1000 Commit to Change WA (sponsored by ACLU of Washington) 2
#> 4 2020-06-19 SARGENT EUGENE 6.25 BRIAN SMILEY 2
#> 5 2020-07-28 ELI LILLY AND COMPANY 500 TAKKO DEAN A 2
#> 6 2020-09-20 VOSS EMILY 75 Loren D. Culp (Loren Culp) 2
#> 7 2020-09-22 EVANS JULIE 100 REPUBLICAN STATE LEADERSHIP COMMITTEE - WASHINGTON PAC 2
#> 8 2020-09-22 FOXWINCHELL CLAUDIA 25 REPUBLICAN STATE LEADERSHIP COMMITTEE - WASHINGTON PAC 2
#> 9 2020-09-29 MURRAY EVAN 15 REPUBLICAN STATE LEADERSHIP COMMITTEE - WASHINGTON PAC 2
#> 10 2020-10-23 HO ANTHONY 15 Gael D Tarleton (Gael Tarleton) 2
#> # ℹ 6,942 more rows
col_stats(wac, n_distinct)
#> # A tibble: 38 × 4
#> col class n p
#> <chr> <chr> <int> <dbl>
#> 1 id <int> 5952878 1
#> 2 report_number <chr> 557793 0.0937
#> 3 origin <chr> 5 0.000000840
#> 4 committee_id <chr> 11258 0.00189
#> 5 filer_id <chr> 7719 0.00130
#> 6 type <chr> 2 0.000000336
#> 7 filer_name <chr> 9102 0.00153
#> 8 office <chr> 44 0.00000739
#> 9 legislative_district <chr> 50 0.00000840
#> 10 position <chr> 219 0.0000368
#> 11 party <chr> 12 0.00000202
#> 12 ballot_number <chr> 132 0.0000222
#> 13 for_or_against <chr> 3 0.000000504
#> 14 jurisdiction <chr> 588 0.0000988
#> 15 jurisdiction_county <chr> 39 0.00000655
#> 16 jurisdiction_type <chr> 5 0.000000840
#> 17 election_year <int> 20 0.00000336
#> 18 amount <dbl> 54142 0.00910
#> 19 cash_or_in_kind <chr> 2 0.000000336
#> 20 receipt_date <date> 6720 0.00113
#> 21 description <chr> 71678 0.0120
#> 22 memo <chr> 3 0.000000504
#> 23 primary_general <chr> 4 0.000000672
#> 24 code <chr> 8 0.00000134
#> 25 contributor_category <chr> 2 0.000000336
#> 26 contributor_name <chr> 1133067 0.190
#> 27 contributor_address <chr> 1160748 0.195
#> 28 contributor_city <chr> 21307 0.00358
#> 29 contributor_state <chr> 200 0.0000336
#> 30 contributor_zip <chr> 20178 0.00339
#> 31 contributor_occupation <chr> 49736 0.00835
#> 32 contributor_employer_name <chr> 155183 0.0261
#> 33 contributor_employer_city <chr> 10929 0.00184
#> 34 contributor_employer_state <chr> 158 0.0000265
#> 35 url <chr> 557793 0.0937
#> 36 contributor_location <chr> 214078 0.0360
#> 37 na_flag <lgl> 2 0.000000336
#> 38 dupe_flag <lgl> 2 0.000000336
summary(wac$amount)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -2500000 9 35 287 100 8929810
mean(wac$amount <= 0)
#> [1] 0.003589692
These are the records with the minimum and maximum amounts.
glimpse(wac[c(which.max(wac$amount), which.min(wac$amount)), ])
#> Rows: 2
#> Columns: 38
#> $ id <int> 10644750, 10645096
#> $ report_number <chr> "100433945", "100441590"
#> $ origin <chr> "C3", "C.1"
#> $ committee_id <chr> "22059", "22059"
#> $ filer_id <chr> "YES1183109", "YES1183109"
#> $ type <chr> "Political Committee", "Political Committee"
#> $ filer_name <chr> "YES ON 1183 COALITION", "YES ON 1183 COALITION"
#> $ office <chr> NA, NA
#> $ legislative_district <chr> NA, NA
#> $ position <chr> NA, NA
#> $ party <chr> NA, NA
#> $ ballot_number <chr> NA, NA
#> $ for_or_against <chr> NA, NA
#> $ jurisdiction <chr> NA, NA
#> $ jurisdiction_county <chr> NA, NA
#> $ jurisdiction_type <chr> NA, NA
#> $ election_year <int> 2011, 2011
#> $ amount <dbl> 8929810, -2500000
#> $ cash_or_in_kind <chr> "Cash", "Cash"
#> $ receipt_date <date> 2011-10-17, 2011-10-17
#> $ description <chr> NA, "CORRECTION TO CONTRIBUTIONS (Reported amount: 8929810.00; Corrected amount: 64…
#> $ memo <chr> NA, NA
#> $ primary_general <chr> "Full Election", NA
#> $ code <chr> "Business", "Other"
#> $ contributor_category <chr> "Organization", "Individual"
#> $ contributor_name <chr> "COSTCO", "COSTCO"
#> $ contributor_address <chr> "999 LAKE DR", NA
#> $ contributor_city <chr> "ISSAQUAH", NA
#> $ contributor_state <chr> "WA", NA
#> $ contributor_zip <chr> "98027", NA
#> $ contributor_occupation <chr> NA, NA
#> $ contributor_employer_name <chr> NA, NA
#> $ contributor_employer_city <chr> NA, NA
#> $ contributor_employer_state <chr> NA, NA
#> $ url <chr> "https://web.pdc.wa.gov/rptimg/default.aspx?repno=100433945", "https://web.pdc.wa.g…
#> $ contributor_location <chr> "POINT (-122.05221 47.5493)", NA
#> $ na_flag <lgl> FALSE, FALSE
#> $ dupe_flag <lgl> FALSE, FALSE
We can add the calendar year from receipt_date
with
lubridate::year()
wac <- mutate(wac, receipt_year = year(receipt_date))
min(wac$receipt_date)
#> [1] NA
sum(wac$receipt_year < 2000)
#> [1] NA
max(wac$receipt_date)
#> [1] NA
sum(wac$receipt_date > today())
#> [1] NA
To improve the searchability of the database, we will perform some
consistent, confident string normalization. For geographic variables
like city names and ZIP codes, the corresponding campfin::normal_*()
functions are tailor made to facilitate this process.
The contributor_address
variable is already sufficiently normalized.
sample(wac$contributor_address, 5)
#> [1] "4304 OREGON DR" "3200 62ND AVE SW" "5407 N SULLIVAN RD" "1401 S SPRAGUE AVE" "2314 N HARMON"
For ZIP codes, the campfin::normal_zip()
function will attempt to
create valid five digit codes by removing the ZIP+4 suffix and
returning leading zeroes dropped by other programs like Microsoft Excel.
wac <- wac %>%
mutate(
zip_norm = normal_zip(
zip = contributor_zip %>%
str_remove_all("-\\d{4}$"),
na_rep = TRUE
)
)
wac %>%
count(contributor_zip, zip_norm, sort = TRUE) %>%
filter(contributor_zip != zip_norm)
#> # A tibble: 427 × 3
#> contributor_zip zip_norm n
#> <chr> <chr> <int>
#> 1 V6J 1 61 83
#> 2 -9-81 981 55
#> 3 V6Z 2 62 19
#> 4 100.0 1000 18
#> 5 98--- 98 11
#> 6 9834- 9834 11
#> 7 +9116 9116 10
#> 8 WA989 989 9
#> 9 500.0 5000 8
#> 10 98,16 9816 8
#> # ℹ 417 more rows
progress_table(
wac$contributor_zip,
wac$zip_norm,
compare = valid_zip
)
#> # A tibble: 2 × 6
#> stage prop_in n_distinct prop_na n_out n_diff
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 wac$contributor_zip 0.995 20178 0.0369 30912 3237
#> 2 wac$zip_norm 0.998 19756 0.0397 13904 2815
Valid two digit state abbreviations can be made using the
campfin::normal_state()
function.
wac <- wac %>%
mutate(
state_norm = normal_state(
state = contributor_state,
abbreviate = TRUE,
na_rep = TRUE,
valid = valid_state
)
)
wac %>%
count(contributor_state, state_norm, sort = TRUE) %>%
filter(contributor_state != state_norm)
#> # A tibble: 40 × 3
#> contributor_state state_norm n
#> <chr> <chr> <int>
#> 1 Washington WA 159
#> 2 wa WA 117
#> 3 Wa WA 115
#> 4 or OR 31
#> 5 WASHINGTON WA 19
#> 6 ca CA 8
#> 7 wA WA 5
#> 8 Or OR 4
#> 9 Ca CA 3
#> 10 Tx TX 3
#> # ℹ 30 more rows
progress_table(
wac$contributor_state,
wac$state_norm,
compare = valid_state
)
#> # A tibble: 2 × 6
#> stage prop_in n_distinct prop_na n_out n_diff
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 wac$contributor_state 1.00 200 0.0346 1493 141
#> 2 wac$state_norm 1 60 0.0348 0 1
Cities are the most difficult geographic variable to normalize, simply due to the wide variety of valid cities and formats.
The campfin::normal_city()
function is a good start, again converting
case, removing punctuation, but expanding USPS abbreviations. We can
also remove invalid_city
values.
norm_city <- wac %>%
distinct(contributor_city, state_norm, zip_norm) %>%
mutate(
city_norm = normal_city(
city = contributor_city,
abbs = usps_city,
states = c("WA", "DC", "WASHINGTON"),
na = invalid_city,
na_rep = TRUE
)
)
We can further improve normalization by comparing our normalized value against the expected value for that record’s state abbreviation and ZIP code. If the normalized value is either an abbreviation for or very similar to the expected value, we can confidently swap those two.
norm_city <- norm_city %>%
left_join(
y = zipcodes,
by = c(
"state_norm" = "state",
"zip_norm" = "zip"
)
) %>%
rename(city_match = city) %>%
mutate(
match_abb = is_abbrev(city_norm, city_match),
match_dist = str_dist(city_norm, city_match),
city_swap = if_else(
condition = !is.na(match_dist) & (match_abb | match_dist == 1),
true = city_match,
false = city_norm
)
) %>%
select(
-city_match,
-match_dist,
-match_abb
)
wac <- left_join(
x = wac,
y = norm_city,
by = c(
"contributor_city",
"state_norm",
"zip_norm"
)
)
The OpenRefine algorithms can be used to group similar strings and replace the less common versions with their most common counterpart. This can greatly reduce inconsistency, but with low confidence; we will only keep any refined strings that have a valid city/state/zip combination.
good_refine <- wac %>%
mutate(
city_refine = city_swap %>%
key_collision_merge() %>%
n_gram_merge(numgram = 1)
) %>%
filter(city_refine != city_swap) %>%
inner_join(
y = zipcodes,
by = c(
"city_refine" = "city",
"state_norm" = "state",
"zip_norm" = "zip"
)
)
#> # A tibble: 298 × 5
#> state_norm zip_norm city_swap city_refine n
#> <chr> <chr> <chr> <chr> <int>
#> 1 WA 98374 PUAYLLP PUYALLUP 74
#> 2 ID 83815 COUER DALENE COEUR D ALENE 56
#> 3 WA 98537 COSMPOLOLIS COSMOPOLIS 35
#> 4 WA 98284 SEDROWOOLLLEY SEDRO WOOLLEY 33
#> 5 ID 83816 COUER DALENE COEUR D ALENE 22
#> 6 WA 98638 NASSEL NASELLE 22
#> 7 WA 98275 MULKITEO MUKILTEO 18
#> 8 WA 98065 SNOAUQLMIE SNOQUALMIE 16
#> 9 NY 11733 SETAUKET EAST SETAUKET 15
#> 10 OH 45202 CINCINATTI CINCINNATI 15
#> # ℹ 288 more rows
Then we can join the refined values back to the database.
wac <- wac %>%
left_join(good_refine, by = names(.)) %>%
mutate(city_refine = coalesce(city_refine, city_swap))
Our goal for normalization was to increase the proportion of city values known to be valid and reduce the total distinct values by correcting misspellings.
stage | prop_in | n_distinct | prop_na | n_out | n_diff |
---|---|---|---|---|---|
str_to_upper(wac$contributor_city) |
0.984 | 16403 | 0.036 | 92489 | 8054 |
wac$city_norm |
0.986 | 15218 | 0.036 | 77709 | 6847 |
wac$city_swap |
0.993 | 11323 | 0.036 | 42122 | 2950 |
wac$city_refine |
0.993 | 11071 | 0.036 | 41322 | 2699 |
You can see how the percentage of valid values increased with each stage.
More importantly, the number of distinct values decreased each stage. We were able to confidently change many distinct invalid values to their valid equivalent.
Before exporting, we can remove the intermediary normalization columns
and rename all added variables with the _clean
suffix.
wac <- wac %>%
select(
-city_norm,
-city_swap,
city_clean = city_refine
) %>%
rename_all(~str_replace(., "_norm", "_clean")) %>%
rename_all(~str_remove(., "_raw")) %>%
relocate(city_clean, state_clean, .before = zip_clean)
glimpse(sample_n(wac, 50))
#> Rows: 50
#> Columns: 42
#> $ id <int> 15713639, 12921313, 5030974, 13034145, 1605044, 16127223, 14415376, 9221402, 257381…
#> $ report_number <chr> "110072239", "100245795", "100714164", "100273743", "100837191", "110091708", "1100…
#> $ origin <chr> "C3", "C3", "C3", "C3", "C3", "C3", "C3", "C3", "C3", "C3", "C3", "C3", "C3", "C3",…
#> $ committee_id <chr> "15340", "10379", "4163", "13094", "3127", "2573", "26199", "10485", "68", "17554",…
#> $ filer_id <chr> "PIERDC 411", "KINGRC 109", "CONDC 807", "MIELT 209", "CITICE 372", "CANDP 507",…
#> $ type <chr> "Political Committee", "Political Committee", "Candidate", "Candidate", "Political …
#> $ filer_name <chr> "Pierce County Democratic Central Committee (PCDCC)", "KING COUNTY REPUBLICAN CENTR…
#> $ office <chr> NA, NA, "STATE REPRESENTATIVE", "COUNTY COMMISSIONER", NA, NA, NA, NA, NA, NA, NA, …
#> $ legislative_district <chr> NA, NA, "12", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "28", NA, NA, "32", NA, NA, "…
#> $ position <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "State Representative Pos. 1", …
#> $ party <chr> "DEMOCRATIC", "REPUBLICAN", "REPUBLICAN", "REPUBLICAN", NA, NA, NA, "REPUBLICAN", "…
#> $ ballot_number <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ for_or_against <chr> NA, NA, NA, NA, "For", NA, "For", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ jurisdiction <chr> NA, NA, "LEG DISTRICT 12 - HOUSE", "SPOKANE CO", "PUYALLUP SD 003", NA, "SPOKANE SD…
#> $ jurisdiction_county <chr> NA, NA, "CHELAN", "SPOKANE", NA, NA, "SPOKANE", NA, NA, NA, NA, NA, NA, "PIERCE", N…
#> $ jurisdiction_type <chr> NA, NA, "Legislative", "Local", NA, NA, "Local", NA, NA, NA, NA, "Statewide", NA, "…
#> $ election_year <int> 2022, 2008, 2016, 2008, 2019, 2022, 2021, 2012, 2018, 2019, 2008, 2008, 2011, 2022,…
#> $ amount <dbl> 27.00, 25.00, 100.00, 250.00, 25.00, 35.00, 3.00, 100.00, 20.00, 5.00, 5.70, 50.00,…
#> $ cash_or_in_kind <chr> "Cash", "Cash", "Cash", "Cash", "Cash", "Cash", "Cash", "Cash", "Cash", "Cash", "Ca…
#> $ receipt_date <date> 2022-01-06, 2008-02-13, 2016-07-27, 2008-09-09, 2018-05-31, 2022-05-06, 2021-02-01…
#> $ description <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ memo <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ primary_general <chr> "Full Election", "Full Election", "Primary", "General", "Full Election", "Full Elec…
#> $ code <chr> "Individual", "Individual", "Individual", "Individual", "Individual", "Individual",…
#> $ contributor_category <chr> "Individual", "Individual", "Individual", "Individual", "Individual", "Individual",…
#> $ contributor_name <chr> "SELLERS DRENA", "GLEDHILL MIYAKO", "SULLIVAN SEAN O", "IMPECOVEN MARJORIE", "NOSWO…
#> $ contributor_address <chr> "3556 S MADISON ST", "15509 61ST AVE NE", "11825 28TH SE", "P.O. BOX 141142", "4121…
#> $ contributor_city <chr> "TACOMA", "KENMORE", "LAKE STEVENS", "SPOKANE VALLEY", "TACOMA", "Yakima", "SPOKANE…
#> $ contributor_state <chr> "WA", "WA", "WA", "WA", "WA", "WA", "WA", "WA", "WA", "WA", "WA", "CA", "WA", "WA",…
#> $ contributor_zip <chr> "98409", "98028", "98258", "99214", "98407", "98908", "99208", "98310", "98446", "9…
#> $ contributor_occupation <chr> NA, NA, NA, "BUSINESS OWNER", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "LANDLORD", N…
#> $ contributor_employer_name <chr> NA, NA, NA, "BLASINGAME INSURANCE", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "SELF",…
#> $ contributor_employer_city <chr> NA, NA, NA, "SPOKANE VALLEY", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "KENT", NA, N…
#> $ contributor_employer_state <chr> NA, NA, NA, "WA", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "WA", NA, NA, NA, NA, NA,…
#> $ url <chr> "https://web.pdc.wa.gov/rptimg/default.aspx?repno=110072239", "https://web.pdc.wa.g…
#> $ contributor_location <chr> "POINT (-122.49006 47.227)", NA, NA, NA, "POINT (-122.49323 47.27793)", "POINT (-12…
#> $ na_flag <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
#> $ dupe_flag <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
#> $ receipt_year <dbl> 2022, 2008, 2016, 2008, 2018, 2022, 2021, 2012, 2018, 2019, 2008, 2007, 2011, 2022,…
#> $ city_clean <chr> "TACOMA", "KENMORE", "LAKE STEVENS", "SPOKANE VALLEY", "TACOMA", "YAKIMA", "SPOKANE…
#> $ state_clean <chr> "WA", "WA", "WA", "WA", "WA", "WA", "WA", "WA", "WA", "WA", "WA", "CA", "WA", "WA",…
#> $ zip_clean <chr> "98409", "98028", "98258", "99214", "98407", "98908", "99208", "98310", "98446", "9…
- There are 5,952,878 records in the database.
- There are 6,962 duplicate records in the database.
- The range and distribution of
amount
anddate
seem reasonable. - There are 2,018 records missing key variables.
- Consistency in geographic data has been improved with
campfin::normal_*()
. - The 4-digit
year
variable has been created withlubridate::year()
.
Now the file can be saved on disk for upload to the Accountability server.
clean_dir <- dir_create(here("wa", "contribs", "data", "clean"))
clean_path <- path(clean_dir, "wa_contribs_20040101-2023.csv")
write_csv(wac, clean_path, na = "")
(clean_size <- file_size(clean_path))
#> 2G
We can use the aws.s3::put_object()
to upload the text file to the IRW
server.
aws_path <- path("csv", basename(clean_path))
if (!object_exists(aws_path, "publicaccountability")) {
put_object(
file = clean_path,
object = aws_path,
bucket = "publicaccountability",
acl = "public-read",
show_progress = TRUE,
multipart = TRUE
)
}
aws_head <- head_object(aws_path, "publicaccountability")
(aws_size <- as_fs_bytes(attr(aws_head, "content-length")))
unname(aws_size == clean_size)