Yanqi Xu 2022-12-25 12:55:13
The Accountability Project is an effort to cut across data silos and give journalists, policy professionals, activists, and the public at large a simple way to search across huge volumes of public data about people and organizations.
Our goal is to standardizing public data on a few key fields by thinking of each dataset row as a transaction. For each transaction there should be (at least) 3 variables:
- All parties to a transaction.
- The date of the transaction.
- The amount of money involved.
This document describes the process used to complete the following objectives:
- How many records are in the database?
- Check for entirely duplicated records.
- Check ranges of continuous variables.
- Is there anything blank or missing?
- Check for consistency issues.
- Create a five-digit ZIP Code called
zip
. - Create a
year
field from the transaction date. - Make sure there is data on both parties to a transaction.
The following packages are needed to collect, manipulate, visualize,
analyze, and communicate these results. The pacman
package will
facilitate their installation and attachment.
The IRW’s campfin
package will also have to be installed from GitHub.
This package contains functions custom made to help facilitate the
processing of campaign finance data.
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("irworkshop/campfin")
pacman::p_load(
tidyverse, # data manipulation
lubridate, # datetime strings
gluedown, # printing markdown
magrittr, # pipe operators
janitor, # clean data frames
aws.s3, # upload to aws s3
refinr, # cluster and merge
scales, # format strings
knitr, # knit documents
vroom, # read files fast
rvest, # html scraping
glue, # combine strings
here, # relative paths
httr, # http requests
fs # local storage
)
This document should be run as part of the R_campfin
project, which
lives as a sub-directory of the more general, language-agnostic
irworkshop/accountability_datacleaning
GitHub repository.
The R_campfin
project uses the RStudio
projects
feature and should be run as such. The project also uses the dynamic
here::here()
tool for file paths relative to your machine.
# where does this document knit?
here::here()
#> [1] "/Users/yanqixu/code/accountability_datacleaning"
The contribution data can be obtained from the Missouri Ethics Commission. We are interested in downloading contributions received by committees, so here we select the Itemized Contributions Received - Form CD1 A. This data is downloaded on Dec. 23, 2020.
raw_dir <- dir_create(here("state","mo", "contribs", "data", "raw"))
source <- "Missouri Ethics Commission"
We can download all the data files from 2011 to 2022.
moc <- map_df(dir_ls(raw_dir,regexp = ".+new.+"),read_delim, delim = ",")
moc <- moc %>% clean_names()
There are 503,247 rows of 18 columns.
glimpse(moc)
#> Rows: 503,247
#> Columns: 18
#> $ cd1_a_id <dbl> 1327191, 1327192, 1327193, 1327194, 1327195, 1327196, 1327197, 1327198,…
#> $ mecid <chr> "C000573", "C000573", "C000573", "C000573", "C000573", "C000573", "C000…
#> $ committee_name <chr> "Safer Families for Missouri", "Safer Families for Missouri", "Safer Fa…
#> $ committee <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ company <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ first_name <chr> "Robert", "J.", "Walter", "John", "John", "Bradley", "James", "John", "…
#> $ last_name <chr> "Beezley", "Bertram", "Bley", "Boyd", "Boyd", "Bradshaw", "Brandenburg"…
#> $ address_1 <chr> "1200 E. Woodhurst R-200", "2345 Grand Blvd. #1925", "1000 W. Nifong Bl…
#> $ address_2 <chr> NA, NA, "Woodrail Centre", NA, NA, NA, NA, NA, NA, NA, "1 Metropolitan …
#> $ city <chr> "Springfield", "Kansas City", "Columbia", "Independence", "Independence…
#> $ state <chr> "MO", "MO", "MO", "MO", "MO", "MO", "MO", "MO", "MO", "MO", "MO", "MO",…
#> $ zip <chr> "65804", "64108", "65203", "64051", "64051", "65804", "63105", "63459",…
#> $ employer <chr> "Robert T. Beezley P.C.", "Bertram & Graf LLC", "Bley & Evans LC", "Boy…
#> $ occupation <chr> "Attorney at Law", "Attorney at Law", "Attorney at Law", "Attorney at L…
#> $ date <chr> "1/24/2020 12:00:00 AM", "1/24/2020 12:00:00 AM", "1/24/2020 12:00:00 A…
#> $ amount <dbl> 25.00, 7.75, 25.00, 25.00, 25.00, 25.00, 12.50, 12.50, 25.00, 25.00, 25…
#> $ contribution_type <chr> "Monetary", "Monetary", "Monetary", "Monetary", "Monetary", "Monetary",…
#> $ report <chr> "April Quarterly Report", "April Quarterly Report", "April Quarterly Re…
tail(moc)
#> # A tibble: 6 × 18
#> cd1_a_id mecid commi…¹ commi…² company first…³ last_…⁴ addre…⁵ addre…⁶ city state zip emplo…⁷
#> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1919111 A1615… "Frien… <NA> <NA> "Carol" Graves 9619 M… <NA> Kans… MO 64134 Teacher
#> 2 1919112 A1615… "Frien… <NA> <NA> "Carol" Graves 9619 M… <NA> Kans… MO 64134 Teacher
#> 3 1919113 A1615… "Frien… <NA> <NA> "Wesle… Eppers… 3600 P… <NA> Inde… MO 64052 Not Em…
#> 4 1919114 C1310… "Dawso… <NA> <NA> "Danie… Imhoff 4954 S… <NA> Spri… MO 65804 Retired
#> 5 1919129 A2221… "Shian… <NA> <NA> "Shian… McMahon PO Box… <NA> Broo… MO 64628 Linn C…
#> 6 1919193 C1210… "Fitzp… Centra… <NA> <NA> <NA> P.O. 7… <NA> Jeff… MO 65102 <NA>
#> # … with 5 more variables: occupation <chr>, date <chr>, amount <dbl>, contribution_type <chr>,
#> # report <chr>, and abbreviated variable names ¹committee_name, ²committee, ³first_name,
#> # ⁴last_name, ⁵address_1, ⁶address_2, ⁷employer
Since the data is tablulated by year, we can filter out the rows that
were already included in the last update using the unique identifier
cd1_a_id
.
n_distinct(moc$cd1_a_id) == nrow(moc)
#> [1] TRUE
nrow(moc)
#> [1] 503247
moc_prev <- read_csv(here("state","mo","contribs", "data", "previous") %>% dir_ls())
moc <- moc %>% filter(cd1_a_id %out% moc_prev$cd1_a_id)
#after filtering out old data, the dataset's row number:
nrow(moc)
#> [1] 297490
Columns vary in their degree of missing values.
col_stats(moc, count_na)
#> # A tibble: 18 × 4
#> col class n p
#> <chr> <chr> <int> <dbl>
#> 1 cd1_a_id <dbl> 0 0
#> 2 mecid <chr> 0 0
#> 3 committee_name <chr> 0 0
#> 4 committee <chr> 277144 0.932
#> 5 company <chr> 276381 0.929
#> 6 first_name <chr> 45131 0.152
#> 7 last_name <chr> 41036 0.138
#> 8 address_1 <chr> 0 0
#> 9 address_2 <chr> 271121 0.911
#> 10 city <chr> 0 0
#> 11 state <chr> 0 0
#> 12 zip <chr> 160 0.000538
#> 13 employer <chr> 50647 0.170
#> 14 occupation <chr> 76220 0.256
#> 15 date <chr> 0 0
#> 16 amount <dbl> 0 0
#> 17 contribution_type <chr> 0 0
#> 18 report <chr> 0 0
We can flag any record missing a key variable needed to identify a transaction.
moc <- moc %>% flag_na(date, last_name, amount, committee_name,address_1)
sum(moc$na_flag)
#> [1] 41036
moc %>%
filter(na_flag) %>%
select(date, last_name, amount, committee,address_1)
#> # A tibble: 41,036 × 5
#> date last_name amount committee address_1
#> <chr> <chr> <dbl> <chr> <chr>
#> 1 10/25/2020 12:00:00 AM <NA> 500 <NA> "1600 INDIAN CREEK RD"
#> 2 10/25/2020 12:00:00 AM <NA> 500 MO OPTOMETRIC PAC "100 EAST HIGH STREET"
#> 3 11/28/2020 12:00:00 AM <NA> 500 LATHROP GAGE CONSULTING PAC "314 E HIGH ST"
#> 4 10/1/2020 12:00:00 AM <NA> 500 <NA> "8600 Hillcrest "
#> 5 10/8/2020 12:00:00 AM <NA> 200 <NA> "6 SW 2nd st Suite 107 "
#> 6 10/5/2020 12:00:00 AM <NA> 500 <NA> "4019 Clark Ave"
#> 7 10/8/2020 12:00:00 AM <NA> 2000 <NA> "1401 Hampton Ave 3rd Floor…
#> 8 10/6/2020 12:00:00 AM <NA> 500 <NA> "6320 Manchester Ave Ste 42…
#> 9 10/14/2020 12:00:00 AM <NA> 500 <NA> "4501 Emanuel Cleaver II Bl…
#> 10 10/7/2020 12:00:00 AM <NA> 2047 <NA> "6601 Winchester Ste 280"
#> # … with 41,026 more rows
We can also flag any record completely duplicated across every column.
moc <- flag_dupes(moc, -cd1_a_id)
sum(moc$dupe_flag)
#> [1] 2435
moc %>%
filter(dupe_flag) %>%
select(date, last_name, amount, committee,address_1)
#> # A tibble: 2,435 × 5
#> date last_name amount committee addre…¹
#> <chr> <chr> <dbl> <chr> <chr>
#> 1 11/17/2020 12:00:00 AM <NA> 500 "MPGA PROPANE PAC " "4110 …
#> 2 11/17/2020 12:00:00 AM <NA> 500 "MPGA PROPANE PAC " "4110 …
#> 3 9/18/2020 12:00:00 AM <NA> 3675 "Catalyst PAC" "PO Bo…
#> 4 9/18/2020 12:00:00 AM <NA> 3675 "Catalyst PAC" "PO Bo…
#> 5 6/26/2020 12:00:00 AM <NA> 5000 "NEXUS PAC" "714 L…
#> 6 6/26/2020 12:00:00 AM <NA> 5000 "NEXUS PAC" "714 L…
#> 7 10/22/2020 12:00:00 AM <NA> 1000 "AT&T MISSOURI EMPLOYEE POLITICAL ACTION COMMIT… "ONE A…
#> 8 10/22/2020 12:00:00 AM <NA> 1000 "AT&T MISSOURI EMPLOYEE POLITICAL ACTION COMMIT… "ONE A…
#> 9 10/23/2020 12:00:00 AM Kambitsch 3 <NA> "2065 …
#> 10 10/23/2020 12:00:00 AM Kambitsch 3 <NA> "2065 …
#> # … with 2,425 more rows, and abbreviated variable name ¹address_1
We can add the calendar year from date
with lubridate::year()
moc <- moc %>% mutate(date = as.Date(date, format = "%m/%d/%Y"),
year = year(date))
min(moc$date)
#> [1] "2002-07-28"
sum(moc$year < 2020)
#> [1] 70
max(moc$date)
#> [1] "2022-12-06"
sum(moc$date > today())
#> [1] 0
col_stats(moc, n_distinct)
#> # A tibble: 21 × 4
#> col class n p
#> <chr> <chr> <int> <dbl>
#> 1 cd1_a_id <dbl> 297490 1
#> 2 mecid <chr> 2908 0.00978
#> 3 committee_name <chr> 2945 0.00990
#> 4 committee <chr> 6334 0.0213
#> 5 company <chr> 11357 0.0382
#> 6 first_name <chr> 14600 0.0491
#> 7 last_name <chr> 31533 0.106
#> 8 address_1 <chr> 98594 0.331
#> 9 address_2 <chr> 6279 0.0211
#> 10 city <chr> 7360 0.0247
#> 11 state <chr> 76 0.000255
#> 12 zip <chr> 19663 0.0661
#> 13 employer <chr> 29094 0.0978
#> 14 occupation <chr> 15741 0.0529
#> 15 date <date> 1009 0.00339
#> 16 amount <dbl> 7916 0.0266
#> 17 contribution_type <chr> 2 0.00000672
#> 18 report <chr> 100 0.000336
#> 19 na_flag <lgl> 2 0.00000672
#> 20 dupe_flag <lgl> 2 0.00000672
#> 21 year <dbl> 6 0.0000202
summary(moc$amount)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -1304.1 11.0 47.1 476.0 200.0 1996059.2
mean(moc$amount <= 0)
#> [1] 0.0002285791
These are the records with the minimum and maximum amounts.
glimpse(moc[c(which.max(moc$amount), which.min(moc$amount)), ])
#> Rows: 2
#> Columns: 21
#> $ cd1_a_id <dbl> 1740820, 1708931
#> $ mecid <chr> "C221856", "C180609"
#> $ committee_name <chr> "Mid-America Carpenters Regional Council Missouri-Kansas Area Political…
#> $ committee <chr> "Carpenters Help in the Political Process (CHIPP)", NA
#> $ company <chr> NA, NA
#> $ first_name <chr> NA, "Michael"
#> $ last_name <chr> NA, "Moehn"
#> $ address_1 <chr> "1401 Hampton Avenue", "31 Logan Rd"
#> $ address_2 <chr> NA, NA
#> $ city <chr> "St Louis", "Alton"
#> $ state <chr> "MO", "IL"
#> $ zip <chr> "63139", "620023206"
#> $ employer <chr> NA, "AMS Ameren Services"
#> $ occupation <chr> NA, "ZA0009 - EVP & Chief Financial Officer"
#> $ date <date> 2022-03-01, 2021-12-28
#> $ amount <dbl> 1996059.18, -1304.07
#> $ contribution_type <chr> "Monetary", "Monetary"
#> $ report <chr> "8 Day Before General Municipal Election-4/5/2022", "January Quarterly …
#> $ na_flag <lgl> TRUE, FALSE
#> $ dupe_flag <lgl> FALSE, FALSE
#> $ year <dbl> 2022, 2021
To improve the searchability of the database, we will perform some
consistent, confident string normalization. For geographic variables
like city names and ZIP codes, the corresponding campfin::normal_*()
functions are tailor made to facilitate this process.
For the street addresss
variable, the campfin::normal_address()
function will force consistence case, remove punctuation, and abbreviate
official USPS suffixes.
moc <- moc %>%
unite(
col = address_full,
starts_with("address"),
sep = " ",
remove = FALSE,
na.rm = TRUE
) %>%
mutate(
address_norm = normal_address(
address = address_full,
abbs = usps_street,
na_rep = TRUE
)
) %>%
select(-address_full)
moc %>%
select(contains("address")) %>%
distinct() %>%
sample_n(10)
#> # A tibble: 10 × 3
#> address_1 address_2 address_norm
#> <chr> <chr> <chr>
#> 1 5730 county loop 182 <NA> 5730 COUNTY LOOP 182
#> 2 341 Towne Vue Dr <NA> 341 TOWNE VUE DR
#> 3 452 gunnison gorge dr <NA> 452 GUNNISON GORGE DR
#> 4 1725 N.W. 9th St. <NA> 1725 NW 9TH ST
#> 5 9 Wilshire Terr <NA> 9 WILSHIRE TER
#> 6 25392 Royal Point Drive <NA> 25392 ROYAL POINT DR
#> 7 3605 Osage Beach Pkwy <NA> 3605 OSAGE BEACH PKWY
#> 8 7 DOUGLASS LANE <NA> 7 DOUGLASS LN
#> 9 966Victoria Ave <NA> 966VICTORIA AVE
#> 10 6676 American Setter Drive <NA> 6676 AMERICAN SETTER DR
For ZIP codes, the campfin::normal_zip()
function will attempt to
create valid five digit codes by removing the ZIP+4 suffix and
returning leading zeroes dropped by other programs like Microsoft Excel.
moc <- moc %>%
mutate(
zip_norm = normal_zip(
zip = zip,
na_rep = TRUE
)
)
progress_table(
moc$zip,
moc$zip_norm,
compare = valid_zip
)
#> # A tibble: 2 × 6
#> stage prop_in n_distinct prop_na n_out n_diff
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 moc$zip 0.731 19663 0.000538 80064 13580
#> 2 moc$zip_norm 0.995 8021 0.000669 1460 672
Valid two digit state abbreviations can be made using the
campfin::normal_state()
function.
moc <- moc %>%
mutate(
state_norm = normal_state(
state = state,
abbreviate = TRUE,
na_rep = TRUE,
valid = valid_state
)
)
moc %>%
filter(state != state_norm) %>%
count(state, state_norm, sort = TRUE)
#> # A tibble: 10 × 3
#> state state_norm n
#> <chr> <chr> <int>
#> 1 Mo MO 785
#> 2 mo MO 100
#> 3 Ne NE 11
#> 4 mO MO 3
#> 5 Fl FL 2
#> 6 Ks KS 2
#> 7 Ca CA 1
#> 8 Co CO 1
#> 9 Il IL 1
#> 10 Mi MI 1
progress_table(
moc$state,
moc$state_norm,
compare = valid_state
)
#> # A tibble: 2 × 6
#> stage prop_in n_distinct prop_na n_out n_diff
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 moc$state 0.997 76 0 920 20
#> 2 moc$state_norm 1 57 0.0000437 0 1
Cities are the most difficult geographic variable to normalize, simply due to the wide variety of valid cities and formats.
The campfin::normal_city()
function is a good start, again converting
case, removing punctuation, but expanding USPS abbreviations. We can
also remove invalid_city
values.
moc <- moc %>%
mutate(
city_norm = normal_city(
city = city,
abbs = usps_city,
states = c("MO", "DC", "MISSOURI"),
na = invalid_city,
na_rep = TRUE
)
)
We can further improve normalization by comparing our normalized value against the expected value for that record’s state abbreviation and ZIP code. If the normalized value is either an abbreviation for or very similar to the expected value, we can confidently swap those two.
moc <- moc %>%
rename(city_raw = city) %>%
left_join(
y = zipcodes,
by = c(
"state_norm" = "state",
"zip_norm" = "zip"
)
) %>%
rename(city_match = city) %>%
mutate(
match_abb = is_abbrev(city_norm, city_match),
match_dist = str_dist(city_norm, city_match),
city_swap = if_else(
condition = !is.na(match_dist) & (match_abb | match_dist == 1),
true = city_match,
false = city_norm
)
) %>%
select(
-city_match,
-match_dist,
-match_abb
)
The OpenRefine algorithms can be used to group similar strings and replace the less common versions with their most common counterpart. This can greatly reduce inconsistency, but with low confidence; we will only keep any refined strings that have a valid city/state/zip combination.
good_refine <- moc %>%
mutate(
city_refine = city_swap %>%
key_collision_merge() %>%
n_gram_merge(numgram = 1)
) %>%
filter(city_refine != city_swap) %>%
inner_join(
y = zipcodes,
by = c(
"city_refine" = "city",
"state_norm" = "state",
"zip_norm" = "zip"
)
)
#> # A tibble: 29 × 5
#> state_norm zip_norm city_swap city_refine n
#> <chr> <chr> <chr> <chr> <int>
#> 1 MO 63031 FLORRISANT FLORISSANT 4
#> 2 MO 63368 OFFALON O FALLON 3
#> 3 MO 65201 COLUMBIA ` COLUMBIA 3
#> 4 SC 29406 NORTH CHARLESTON CHARLESTON 3
#> 5 MO 63034 FLORRISANT FLORISSANT 2
#> 6 MO 63073 ST ALBINS SAINT ALBANS 2
#> 7 MO 65203 COLUMBIA ` COLUMBIA 2
#> 8 SD 57104 SOUIX FALLS SIOUX FALLS 2
#> 9 AZ 85015 PHOENIX PHOENIX PHOENIX 1
#> 10 CA 91106 PASENDENA PASADENA 1
#> # … with 19 more rows
Then we can join the refined values back to the database.
moc <- moc %>%
left_join(good_refine) %>%
mutate(city_refine = coalesce(city_refine, city_swap))
Our goal for normalization was to increase the proportion of city values known to be valid and reduce the total distinct values by correcting misspellings.
stage | prop_in | n_distinct | prop_na | n_out | n_diff |
---|---|---|---|---|---|
toupper(moc$city_raw) | 0.846 | 5912 | 0 | 45826 | 2019 |
moc$city_swap | 0.991 | 4395 | 0 | 2772 | 464 |
You can see how the percentage of valid values increased with each stage.
More importantly, the number of distinct values decreased each stage. We were able to confidently change many distinct invalid values to their valid equivalent.
Before exporting, we can remove the intermediary normalization columns
and rename all added variables with the _clean
suffix.
moc <- moc %>%
select(
-city_norm,
-city_swap,
city_clean = city_refine
) %>%
rename_all(~str_replace(., "_norm", "_clean")) %>%
rename_all(~str_remove(., "_raw"))
glimpse(sample_n(moc, 50))
#> Rows: 50
#> Columns: 25
#> $ cd1_a_id <dbl> 1769267, 1658941, 1561178, 1888638, 1791696, 1548187, 1650628, 1869689,…
#> $ mecid <chr> "C201509", "C180553", "C000374", "C191081", "C171069", "C180576", "C180…
#> $ committee_name <chr> "Piper for Missouri", "IST-MO Nominee PAC Number Two", "PUBLIC SAFETY C…
#> $ committee <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Profession…
#> $ company <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ first_name <chr> "Scott", "Jessica", "Ross", "Sally", "Tyrell", "BRANDON", "ROBERT", "Ch…
#> $ last_name <chr> "Duff", "LaBozzetta", "Grundyson", "Dehner", "Aagard", "STEINAGEL", "BI…
#> $ address_1 <chr> "23 Meandering Way", "813 Westbrooke Meadows Court", "2800 Cherry St", …
#> $ address_2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ city <chr> "Round Rock", "St. Louis", "Kansas City", "St. Louis", "Salt Lake City"…
#> $ state <chr> "TX", "MO", "MO", "MO", "UT", "MT", "MI", "MO", "MO", "MO", "CA", "MO",…
#> $ zip <chr> "78664", "63021", "64108", "63129", "84111", "59602-7381", "48374-2568"…
#> $ employer <chr> "Self", "Venture Cafe St. Louis", "KC MO Fire Department", "Not employe…
#> $ occupation <chr> "Atty", "Director", "Chief Officer", "Not employed", "Grants Specialist…
#> $ date <date> 2022-02-09, 2021-09-08, 2020-10-02, 2022-09-25, 2022-04-15, 2020-10-30…
#> $ amount <dbl> 25.00, 10.00, 10.00, 25.00, 20.00, 16.67, 416.67, 5.00, 50.00, 100.00, …
#> $ contribution_type <chr> "Monetary", "Monetary", "Monetary", "Monetary", "Monetary", "Monetary",…
#> $ report <chr> "April Quarterly Report", "October Quarterly Report", "AMENDED 8 Day Be…
#> $ na_flag <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
#> $ dupe_flag <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
#> $ year <dbl> 2022, 2021, 2020, 2022, 2022, 2020, 2021, 2022, 2022, 2022, 2022, 2022,…
#> $ address_clean <chr> "23 MEANDERING WAY", "813 WESTBROOKE MEADOWS CT", "2800 CHERRY ST", "54…
#> $ zip_clean <chr> "78664", "63021", "64108", "63129", "84111", "59602", "48374", "64118",…
#> $ state_clean <chr> "TX", "MO", "MO", "MO", "UT", "MT", "MI", "MO", "MO", "MO", "CA", "MO",…
#> $ city_clean <chr> "ROUND ROCK", "ST LOUIS", "KANSAS CITY", "SAINT LOUIS", "SALT LAKE CITY…
- There are 297,490 records in the database.
- There are 2,435 duplicate records in the database.
- The range and distribution of
amount
anddate
seem reasonable. - There are 41,036 records missing key variables.
- Consistency in geographic data has been improved with
campfin::normal_*()
. - The 4-digit
year
variable has been created withlubridate::year()
.
Now the file can be saved on disk for upload to the Accountability server.
clean_dir <- dir_create(here("state","mo", "contribs", "data", "clean"))
clean_path <- path(clean_dir, "mo_contribs_clean_2020-2022.csv")
write_csv(moc, clean_path, na = "")
(clean_size <- file_size(clean_path))
#> 68.7M
file_encoding(clean_path) %>%
mutate(across(path, path.abbrev))
#> # A tibble: 1 × 3
#> path mime charset
#> <fs::path> <chr> <chr>
#> 1 …tability_datacleaning/state/mo/contribs/data/clean/mo_contribs_clean_2020-2022.csv <NA> <NA>
We can use the aws.s3::put_object()
to upload the text file to the IRW
server.
s3_path <- path("csv", basename(clean_path))
if (!object_exists(s3_path, "publicaccountability")) {
put_object(
file = clean_path,
object = s3_path,
bucket = "publicaccountability",
acl = "public-read",
show_progress = TRUE,
multipart = TRUE
)
}
s3_head <- head_object(s3_path, "publicaccountability")
(s3_size <- as_fs_bytes(attr(s3_head, "content-length")))
unname(s3_size == clean_size)