ooooo ooooo .oooooo. oooooo oooooo oooo ooooooooooooo .oooooo. `888' `888' d8P' `Y8b `888. `888. .8' 8' 888 `8 d8P' `Y8b 888 888 888 888 `888. .8888. .8' 888 888 888 888ooooo888 888 888 `888 .8'`888. .8' 888 888 888 888 888 888 888 `888.8' `888.8' 8888888 888 888 888 888 888 `88b d88' `888' `888' 888 `88b d88' o888o o888o `Y8bood8P' `8' `8' o888o `Y8bood8P'
A cheat sheet for common data journalism stuff. For details on installing these tools, see how I work. Use CMD + F to search the page, or the jump menu below if you know what you're looking for.
R packages shortcut keys | base | readr | scales | ggpmisc | dplyr | stringr | plyr | clipr | googlesheets4 | janitor For more on tips and tricks for working with common data formatting issues in R, check out the code snippet collection here
Troubleshooting Location errors
Command line tools
A collection of tips and tricks for working with tools executed using bash terminals.
Search text files for specific character sequences.
grep "DUKES" absentee.csv
Return lines containing a string from a specified file and print to the command line.
Print to a file
grep "\"DUKES\",\"MICHAEL\",\"TYLER\"" absentee.csv > dukes.csv
Search for a string with quotes and output all lines to a file.
Recursively search a directory of files, first line only
head -1 ./*/*|grep -B1 'Hospital overall rating' > variable.txt
grep with a pipe allows you to chain commands, and the
-B1 flag allows you to output the file name.
Get a preview of a file.
See the top
Print the first 10 lines of a file to the command line.
Specify the number of lines
head -1 absentee.csv > absentee_header.csv
Get the first line of a file and save it to a CSV file.
A power tool for processing and converting video and audio files.
Make a GIF
ffmpeg -ss 5 -t 1.7 -i video.MOV -vf "fps=24,scale=640:-1:flags=lanczos,split[s0][s1];[s0]palettegen[p];[s1][p]paletteuse" -loop 0 video.gif
Create a high quality GIF 640 pixels wide at 24 frames per second using the specified video file, skipping 5 seconds and lasting 1.7 seconds and save the output.
Caption the GIF
ffmpeg -ss 278.8 -t 3.3 -i wings.mp4 -filter_complex "fps=24,scale=640:-1:flags=lanczos,drawtext=box=1:email@example.com:boxborderw=5:fontfile=/System/Library/Fonts/Supplemental/Impact.ttf:text='CONTRARY TO MY APPEARANCE,':fontsize=48:fontcolor=white:x=(w-tw)/2:y=(h/PHI)+th,drawtext=box=1:firstname.lastname@example.org:boxborderw=5:fontfile=/System/Library/Fonts/Supplemental/Impact.ttf:text='I AM ENJOYING THIS.':fontsize=48:fontcolor=white:x=(w-tw)/2:y=(h/PHI)+th+50,split[x1][x2];[x1]palettegen[p];[x2][p]paletteuse" wings.gif
Tweak the parameters (or delete the second line) to adjust the font, text etc. of the caption.
MOV to mp4
ffmpeg -i apple.mov -vcodec h264 -acodec aac apple.mp4
Use a codec flag to convert a video file from QuickTime to the more universal mp4 format.
AVI to mp4
ffmpeg -i full_video.avi full_video.mp4
Convert an AVI file to the more universal mp4 format.
Combine video clips
ffmpeg -f concat -safe 0 -i vidlist.txt -c copy full_video.avi
Combine all of the files recorded in a text file called
vidlist.txt, which looks like this:
file '/Users/username/directory/vid_seq001.avi' file '/Users/username/directory/vid_seq002.avi'
Create a video from a sequence of images
ffmpeg -r 1/5 -i img%03d.jpg -c:v libx264 -vf "fps=25,format=yuv420p" out.mp4
Read in a sequence of images from a folder and write to an mp4 file. The
-r flag is the framerate, where the duration of each image is the inverse of the provided value (e.g. 1/5 is 5 seconds, 60 is 1/60 of a second). The
-i flag specifies the filename structure, with 0 padding specified (e.g. img%03d.jpg will iterate through img001.jpg, img002.jpg, img003.jpg etc.). More details here.
A power tool for processing and converting PDF files.
pdftk *.pdf cat output all_documents.pdf pdftk doc01.pdf doc02.pdf cat output all_documents.pdf
Combine all the PDF files in the present directory into a single file. Or specify individual files Note: check to make sure the capitalization of the filetype matches.
Split files by page number
pdftk blue_docs.pdf cat 1-700 output blue_docs01.pdf pdftk blue_docs.pdf cat 701-end output blue_docs02.pdf
Specify the page number or use the
end keyword to slice up a document.
Split a PDF portfolio
pdftk doj_emails_portfolio.pdf unpack_files output doj_emails
PDF portfolios contain a bunch of individual files bound up in a filetype that needs a native PDF reader. Get around this by unpacking each file into a specific directory.
Split a PDF portfolio with attachments
pdftk doj_emails_portfolio.pdf unpack_files output doj_emails; IFS=$'\n'; set -f for f in $(find ./doj_emails/ -name '*.pdf'); do pdftk "$f" unpack_files output ./doj_emails/; done unset IFS; set +f
If your PDF portfolio has attachements within the individual PDF, you can use your terminal to unpack the portfolio into a directory, then set up a loop to unpack all of the PDFs in that directory on by one (Thanks to this Stackoverflow thread for tips on bash recursion and dealing with spaces in filenames.)
A targeted utility to pull down geographic data from ESRI maps.
Download a geojson
esri2geojson https://services.arcgis.com/iFBq2AW9XO0jYYF7/arcgis/rest/services/Covid19byZIPnew/FeatureServer/0 nc_zipDATE.geojson
Download data from the ESRI REST endpoint that powers the N.C. DHHS COVID map of cases by zip code and save it as a geojson file.
A power tool for recursively downloading files, for example from the Web.
Download a single file
Saves the file from the specified URL.
Download a list of files
wget -i file_list.txt
Saves individual files from URLs specified in a TXT file, one URL on each line.
Download a directory
wget --recursive --no-parent http://mtdukes.com
Recursively download an entire site's contents.
Tool to do various file formatting things I think.
Detect encoding of a file
file -I input.csv
Detect encoding of a file.
Tool to make substitutions in a text file (submitted by Chris Alcantara).
Replace all instances of a word and output result to new file
sed "s/dook/duke/g" ./old.csv > new.csv
/ as a delimiter to separate patterns you want to substitute.
Replace all instance of a word within the file
sed -i "" "s/dook/duke/g" ./data.csv
Substitute directly in the file by passing an empty string after the
Pass a file name to make a backup.
sed -i "./data-backup.csv" "s/dook/duke/g" ./data.csv
A basic character counting utility for the command line.
Count the number of lines in a file
wc -l < data_file.txt
< flag excludes the file name from the results.
Pipe the results of some data and count the lines
curl mtdukes.com --silent | wc -l
A power tool for quickly editing images.
Batch crop a folder of images
mogrify -crop 800x450+0+40 -path ./cropped *.jpg
In a folder of images, crop every jpg image at size 800x450, with a 0px offset from the left (x) and a 40px offset from the top (y).
Convert PST to MBOX
Outputs a single file in mbox format, which is a more open format you can import into a number of email clients. That includes Mozilla's Thunderbird, which you can import using the ImportExportNG add-on.
Convert PST to individual email files
readpst -e -D public_records.pst
Separates the PST into individual eml files. Attachments are embedded in the file. The
-D flag preserves deleted items. Can be read by services like Google's Pinpoint.
Convert PST to rich text files and export attachments
readpst -S -D public_records.pst
Separates the PST into individual eml files, each emails rich text body and individual attachment. The numbered files are in eml format, with no extension. The
-D flag preserves deleted items.
Convert PST to eml and msg files
readpst -m -D public_records.pst
Produces both msg and eml files for each message. The
-D flag preserves deleted items.
Quickly and efficiently unzip files (or a folder full of files). Can also unzip some files where a normal unzipping GUI throws errors. Useful for
.gz file extensions.
Unzip a single file
Unzips a specific file and replaces it with the unzipped version.
Keep the original zipped file
gunzip -c map_file.gz > map_file.shp
stdout flag to read to the console, but pipe to a new file to keep the original.
Unzip a folder full of files
gunzip -r /map_files
Uses the recursive flag to iterate through every zipped file in a folder and replace it with the unzipped version.
A suite of tools to anaylze your system (for Mac).
Examine CPU threads
sysctl hw.physicalcpu hw.logicalcpu
Provides an output of physical and logical cores your CPU has.
A few common shortcuts save you from typing in RStudio.
Start a new section
Command + Shift + R
Prompt for a new label used in the document outline for an R script
Execute a command
Command + Enter
Run a section of code in your R script.
Use a pipe
Command + Shift + M
%>% at your cursor to pipe output to the next line with the
Use an assignment
Option + -
<- at your cursor to assign output to a variable.
The stripped down version of R has lots of built-in stuff worth using.
Clear all environment variables
rm(list = ls())
Start with a clean slate using the
Set your working directory
Save time otherwise spent typing out long path names.
Get unique values
property_sales %>% unique()
Generates a dataframe of unique rows across all fields.
Get duplicate values
wake_sales %>% .[duplicated(.), ]
Generates a dataframe of duplicated rows, comparing all fields.
Get a list of files in a folder
county_data <- list.files(path = './data/counties', full.names = TRUE)
full.names flag prepends the directory path if
TRUE, and file name only if
Remove a package
Removing a package can help when you have conflicts between functions with the same name.
Format a date
as.Date('01/01/2001', format = '%m/%d/%Y' )
Specify the format explicitly using the syntax from
Find and replace characters in a string
gsub(',', '', 'womp,womp') gsub('\\(', ',for real', 'Replace the literal parenthesis (' )
Enter a pattern, replacement and data value to search.
Get rid of non-ASCII characters
gsub('[^ -~]', '', '日本人GALATIA')
This pattern, translates to "not any ASCII character". Useful when cleaning a malformed file. Can also use the
[^ -~] in other contexts.
Load a file with an annoying encoding
vax_data <- read.delim('cnty20210731.csv', fileEncoding = 'UTF-16LE', sep = '\t', col.names = c('index', 'county', 'week_of', 'age12_17', 'age18_24', 'age25_49', 'age50_64') )
Turn off scientific notation
options(scipen = 999)
Prints out the full numeral in your current workspace.
Test for significance when preparing proportions
#Black maternal mortality in 2020 and 2021 compared to the number of live births prop.test(x = c(323, 506), n = c(397827, 517889), correct = FALSE)
Use a 2-sample test for equality of proportions to generate a p-value, which tells you how likely it is that the observed difference is due to random chance. Typically, p-values of less than 0.05 are considered statistically signficant (95% confidence interval).
prop.test function accepts a vector of values for the numerator (
x) and the denominator (
A "fast and friendly way" to read in data. Part of the tidyverse suite of packages.
Specify a default column type
my_data <- read_csv('../my_data.csv', col_types = cols(.default = 'c', date = 'D'))
read_csv to import all columns as a character by default, except for the date field, which should be a date (you can remove the date part if you want to read everything in as a character).
Get the file encoding
guess_encoding(file = 'annoying_file.csv')
Useful for errors reading in the file (like embedded nulls).
Pull data from a GitHub gist
county_fips <- read_tsv( url( 'https://gist.githubusercontent.com/mtdukes/e0c6563927fb4f3e48f4e092b84b7023/raw/56e5abc1daf2277b5d901cbc25b9f9e64ab8c073/nc_fips_tab.tsv' ))
url function from base R to pull formatted data posted in public gists, like these convenience files. Swap out the delimiter function (read_csv, read_delim, etc) as needed. Get the URL by clicking "Raw" on the Gist page.
Specify a file encoding when loading data
my_dataframe <- read_tsv('annoying_file.csv', locale = locale(encoding = "UTF-16LE") )
Specify non-UTF encodings you get from the
Write and timestamp a csv
my_spreadsheet %>% write_csv(paste0('my_data/my_spreadsheet', format(Sys.time(),'%Y%m%d%H%M'), '.csv'))
Sys.time(), you can quickly save a dataset with an automatic timestamp for easy organization. No more
A library to make scaling and labeling easier.
Show figures as dollars
vax_income %>% ggplot(aes(x = median_income, y=PctTotal)) + geom_point() + scale_x_continuous(labels = scales::dollar_format())
dollar_format function shortcuts the annoying parsing issues.
Miscellaneous extensions to the ggplot package.
Include a regression equation on your scatterplot
vax_income %>% ggplot(aes(x = median_income, y = pct_total)) + geom_point() + geom_smooth(method = "lm", formula = y ~ x, show.legend = FALSE) + stat_poly_eq(aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), label.x.npc = "right", label.y.npc = 0.15, formula = y ~ x, parse = TRUE, size = 3)
stat_poly_eq function lets you annotate the graph with a regression formula. BONUS: What's a good value for R-squared?
Get a random sample of rows
nc_voters %>% sample_n(10)
Specify the number of rows from the dataframe to return.
Convert table to uppercase
clean_table <- dirty_table %>% mutate(across(where(is.character), toupper))
Tranforms all columns containing characters to uppercase all at once. Incredibly useful for cleaning data!
Fix multibyte strings and bad character encodings
df_clean <- df %>% mutate(across(everything(), ~ iconv(.x, sub = '') ))
Works across an entire dataframe, removing all malformed characters, multibyte strings or bad, non-UTF8 encodings that can't be converted.
Remove prefix characters from column names
clean_table <- dirty_table %>% rename_all(~str_replace(., 'column_prefix', ''))
rename_all to replace all matching patterns. Uses regex.
Tools to solve common problems, like performing the same task over and over. NOTE: This package conflicts with some dplyr commands, so if you're getting weird errors, this might be why.
Repeat a function using a list as input
precinct_sort <- ldply(county_files, read_tsv, na='', col_types = cols( county_id = col_double(), election_dt = col_date(format = '%m/%d/%Y'), contest_id = col_double(), contest_title = col_character(), contest_vote_for = col_double(), precinct_code = col_character(), candidate_id = col_double(), candidate_name = col_character(), candidate_party_lbl = col_character(), voting_method_lbl = col_character(), voting_method_rslt_desc = col_character(), vote_ct = col_double() ))
The first parameter is the list and the second is the function you want to repeat. Everything that follows are parameters specific to your function. You can use your own functions too.
A set of simple commands for writing to and reading from the clipboard.
Write to the clipboard
age_group_populations %>% mutate(lookup_helper = paste0(fips,age_group), .after = 'age_group') %>% write_clip()
Copies dataframes in a tab-delimited format for easy pasting into spreadsheets.
Read and write data from Google Sheets. Part of the tidyverse.
Read in data from a Google Sheet
test <- read_sheet('https://docs.google.com/spreadsheets/d/<<SPREADSHEET_ID>>/edit#gid=<<SHEET_ID>>')
Authentication may be required depending on permissions. Accepts the URL of the sheet and writes to a dataframe.
Overwrite data in an existing sheet
write_sheet(census_data, as_sheets_id('https://docs.google.com/spreadsheets/d/<<SPREADSHEET_ID>>/edit#gid=<<SHEET_ID>>'), sheet = "census_data" )
Accepts a dataframe and writes to a sheet specified with the
as_sheets_id function and a URL. If a sheet name isn't specified, it will create a new sheet with the dataframe name.
Excellent library with convenience functions for cleaning and exploring data.
Get the totals of a dataframe
zoo_animals %>% count(animal_type) %>% adorn_totals()
adorn_totals to tally up a number column (or multiple columns) in a dataframe. Particularly useful with
Calculate percentages across categories
zoo_animals %>% count(animal_type) %>% adorn_totals() %>% adorn_percentages('col') %>% adorn_pct_formatting() %>% adorn_ns(position = "front")
A few extra lines of code place both the raw number and percentage together.
Math for journalism
Formulas and concepts I always have to look up. For more, read Numbers in the Newsroom by Sarah Cohen.
Also check out Ben Welsh's Observable collection of calculators based on Cohen's book.
A measure of how tightly clustered, or varied, data is around a set of values. Often described using the Greek letter sigma (σ).
For normally distributed data (often described as a "bell curve"), about two thirds of the observed values fall within one standard deviation of the average.
Calculating the rate of something happening in a subgroup and comparing it to another can help suss out disproportionate impact, especially when the groups are different sizes.
Comparing two groups
A school has 300 white and 120 Black students. Last year, 30 white students were suspended, and 20 Black students were suspended.
It may be enough to say that Black students made up 60% of suspensions while only making up 29% of the school. But you may want to put a finer point on the disparity. First, calculate the suspension rate for Black students
20 / 120 = 0.17
Next, do the same for white students.
30 / 300 = 0.1
Then you compare the two rates of suspension.
0.167 / 0.1 = 1.7
So black students are suspended at about 1.7 times the rate of white students.
Comparing two groups
Instead of calculating and comparing the rate of something happening in a subgroup, calculating an odds ratio means you have to look at how much more likely something is to happen than not happen within that subgroup. So:
A class has 21 boys and 16 girls. On a recent test, 11 boys and 14 girls passed.
First calculate the likelihood that boys passed the test vs. not passing the tests
11 / 10 = 1.1
Do the same for the girls.
14 / 2 = 7
Now you can compare the ratios.
7 / 1.1 = 6.4
So girls are 6.4 times as likely to pass the test than boys.
Making sense of symbols
Tips and tricks for understanding mathematical symbols outside the scope of the normal add, subtract, etc.
As Freya Holmér points out, these two "scary math symbols" are just for loops (Image courtesy of Freya Holmér).
In contexts like public health, rare events such as cancer or maternal mortality assume a Poisson distribution, which we can use to calculate confidence intervals used to compare rates. For more, see this handout from the Utah Department of Health and Human Services:
This distribution is not symmetric about its mean and so the associated confidence intervals will not be symmetric (the upper limit is farther from the estimate than is the lower limit).
#Pregnancy-related deaths (observed events) 1345 #95% confidence Interval 0.95 #Calcuated confidence interval [1274.07160, 1418.84932] #Live births 4636541 #Pregnancy-related death rate per 100,000 live births 1345 / 4636541 * 100000 = 29.0 #Lower bound 1274.07160 / 4636541 * 100000 = 27.5 #Upper bound 1418.84932 / 4636541 * 100000 = 30.6
If you don't want to use a calculator, you can also calculate the upper and lower bounds using the chi square critical value in R using the formulas here.
#for 95% (0.95) confidence interval #where x is the number of observed events lower_bound <- qchisq((1-ci)/2, 2*x)/2 upper_bound < qchisq(1-(1-ci)/2, 2*(x+1))/2
More significance testing
One of the most common confidence levels is 95% (also called the 97.5th percentile point).
|Confidence level||Alpha (α)||α/2||Zα/2||Z1 - α/2||Z score (Zα/2)|
The U.S. Census Bureau has good resources for testing for signficance with their data, which used a 90% confidence level.
General guidance for working with mapping files and geographic information systems.
Latitude and Longitude
(Image credit Illinois State University)
Latitude is the Y axis, with the equator at 0. Longitude is the X axis, with the prime meridian at 0.
North America, located in the north-west quadrant, latitude values will be positive or N. Longitude values will be negative or W.
NC bounding box
For subsetting coordinates or geometries that requires a "bounding box," use these coordinates for North Carolina (h/t to Anthony Louis D'Agostino).
Plugins, URL parameters and other neat stuff.
Tips and tricks for handling PDFs in a Web browser (like Chrome)
Jump to a page
page number as a URL parameter to link directly to a page.
Shortcuts and other cool things that help navigate various online video players.
Jump to a time in YouTube
t parameter to specify the jump point in seconds.
A few common things to check when things get mucked up.
Mapping or working with counties, cities, etc. in North Carolina.
You three again.
If you're missing one of North Carolina's 100 counties, or your map is inexplicably blank, check New Hanover County, McDowell County and Winston-Salem first. Then think of other location names that might not be a literal, string-to-string match.
Collections of commonly used lists and references in various data structures.
- State names and postal code abbreviations, comma- and line-separated and text-qualified by single quotes
- US counties and FIPS codes by state, tab-delimited
- NC counties, comma- and line-separated and text-qualified by single quotes
- NC counties and FIPS codes, tab-delimited
- NC counties and election/voter data codes, tab-delimited
- NC counties and court codes, as tuples
- NC counties and court codes, tab-delimited
- NC municipalities and counties, tab-delimited
- NC places and census designation, tab-delimited
- Not where you expect - A list of NC cities not in the counties you think