/
pdftools-guide.rmd
506 lines (372 loc) · 16.6 KB
/
pdftools-guide.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
---
title: "Beer and pdftools - a vignette"
author: "Thomas Mock"
date: "3/30/2020"
output:
html_document:
toc: true
toc_depth: 4
toc_float: true
theme: cosmo
dpi: 500
fig_width: 10
fig_height: 8
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
![](beer-kegs.jpeg)
# Scraping Complex Tables from PDFs with PDF Tools
The goal of this is to provide a guide to extracting irregularly formatted tables from PDFs.
### Load Libraries
We'll use ROpenSci's `pdftools` package along with several `tidyverse` packages:
- `stringr` - text manipulation
- `dplyr` - general data manipulation
- `tidyr` - data cleaning
- `purrr` - repeated application of a function
```{r}
library(tidyverse)
library(pdftools)
```
## PDFs
The PDFs for this guide come from [Alcohol and Tobacco Tax and Trade Bureau](https://www.ttb.gov/beer/statistics). We'll use the 2011-2014 data for this example (84 total PDFs). For the purpose of today the files have already been downloaded, but I used the following script.
```{r, eval = FALSE}
# General function for download
download_monthly_stats_pdf <- function(year){
message(paste0("Downloading ", year))
# The general format is yearmonth like 201101 for Jan 2011.
month_in <- c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12")
year_vec <- rep(year, 12)
url_build <- function(year_vec, month_in){
glue::glue("https://www.ttb.gov/images/pdfs/statistics/{year}/{year}{month_in}beer.pdf")
}
# output to the pdfs folder
download_monthly_pdf <- function(year, month, url_in){
download.file(
url = url_in,
destfile = glue::glue("pdfs/ttb_monthly_stats_{year}-{month}.pdf")
)
}
# build all the input urls and attach to an input dataframe
full_df <- tibble(year = year_vec, month = month_in) %>%
mutate(url_in = pmap_chr(.l = list(year_vec, month_in), .f = url_build))
# The pwalk here takes all 3 inputs and applies them to download_monthly_pdf function
pwalk(full_df, .f = download_monthly_pdf)
}
```
We could apply that function to all the years of interest with another `purrr::walk()` call. This will run `download_monthly_stats_pdf()` for 2011, 2012, 2013, and 2014.
```{r, eval = FALSE}
walk(c(2011:2014), download_monthly_stats_pdf)
```
### List File Names
```{r}
# list all the files we have downloaded so far
all_files <- list.files("pdfs")
length(all_files)
```
We have 48 PDFs, as expected - 12 months x 4 years = 48!
Now let's take a peek inside one of the PDFs.
![](PDF-table.png)
### Raw PDFs
When we run `pdftools::pdf_text()` we can see a decently formatted table. The main issue to consider is that there is a lot erroneous header descriptions, and there are unequal spacing between "columns" in the table. Importantly, each line of the PDF is separated by a newline `\n`. This is key to our strategy for pulling out individual lines.
```{r}
pdftools::pdf_text("pdfs/ttb_monthly_stats_2011-01.pdf")
```
#### Split by row
We can use `stringr::str_split()` to separate the text at each of the `\n` newlines. This generates a list of character strings, we call `unlist()` to extract to a vector. We now have a nicely separated vector of character strings, where each row is a new string.
```{r}
raw_text <- pdftools::pdf_text("pdfs/ttb_monthly_stats_2011-01.pdf") %>%
str_split("\n") %>%
unlist()
raw_text
```
## Build Table
Now that we have the data split into a vector we can start finding "rows" to drop. We can see that the 9th string is actually the column titles, and the table ends at the 36th string. However, this could change according to which PDF we are looking at, so rather than going by position we can use `stringr::str_which()` to match a logical with matched text.
```{r}
# Start of table - column names
raw_text[9]
# End of table - last value
raw_text[36]
```
We get the same "rows" with our matching `str_which()`.
```{r}
# find start of table
stringr::str_which(raw_text, "MANUFACTURE OF BEER")
# find end of table
stringr::str_which(raw_text, "Total Used")
```
Let's actually assign this now, rather than just printing. We can also remove leading/trailing whitespace with `stringr::str_trim()`. When we look at table_trimmed we can "see" a group of text strings that much closer resemble a table!
```{r}
table_start <- stringr::str_which(raw_text, "MANUFACTURE OF BEER")
# End of table (drop all the asterisks and the other descriptors)
table_end <- stringr::str_which(raw_text, "Total Used")
# Trim the table to the start/end and drop whitespace at each line
table_trimmed <- raw_text[table_start:table_end] %>%
str_trim()
table_trimmed
```
### Remove all the extra whitespace
Next we need to remove all the huge whitespaces from between columns. The regular expression (regex) of `"\\s{2,}"` matches whitespaces of 2 or more. If we use `stringr::str_replace_all()` to take all the whitespaces > 2 and replace with a new delimiter such as `"|"` we can move to our next step. While we're at it, let's remove all the commas so that we can go straight to doubles rather than characters for all the beer production variables.
```{r}
# Replace long spaces with a col break symbol
squished_table <- str_replace_all(table_trimmed, "\\s{2,}", "|") %>%
str_remove_all(",")
squished_table
```
### Convert to tibble
Now we have a nicely formatted vector of strings! We can use `tibble::enframe()` to create a dataframe/tibble out of the vector.
```{r}
# Convert to tibble
raw_df <- enframe(squished_table)
raw_df
```
Next we can separate value into the 5 columns. Notice that there are a few "rows" where the data is NA as there were rows that acted only as indicators of the type of beer production. We'll use them later.
```{r}
year <- 2011
month <- "02"
# Convert to tibble
beer_df <- raw_df %>%
separate(value,
into = c("type", "month_current", "month_prior_year", "ytd_current", "ytd_prior_year"),
sep = "\\|") %>%
slice(-1) %>%
mutate_at(vars(month_current:ytd_prior_year), as.double) %>%
mutate(year = as.integer(year), month = as.integer(month)) %>%
select(year, month, type, everything())
beer_df
```
Technically at this point, we have successfully converted from raw text to a dataframe/table/tibble! HOWEVER, for many many examples in the wild you will need to do additional data cleaning, data manipulation, factor assignment, etc. As such, I'll continue working on this to get to a final output. I'll also work on repeating this many times as opposed to one time.
### Proper Cleaning
This is actually two datasets that are combined into one large reporting table. As such we need to identify the specific row/point to split the dataset at. We can filter to just the row that matches either the string "MATERIALS USED" or "IN POUNDS", as that indicates a label starting the 2nd dataset.
```{r}
slice_num <- beer_df %>%
# find a string that has MATERIALS USED or IN POUNDS
# | means OR
filter(str_detect(type, "MATERIALS USED|IN POUNDS")) %>%
pull(name)
slice_num
```
#### Split dataframe
Next we will add a column based on logic for the `slice_num`, and assign a grouping variable for either `Barrels Produced` (dataset 1) or `Pounds of Materials Used` (dataset 2). We can then drop the unneeded rows with a `filter()`, `group_by(`) the newly produced grouping variable, and use `dplyr::group_split()` to separate the combined dataset into a list of both datasets.
```{r}
# split data into materials vs barrels produced
split_df <- beer_df %>%
mutate(data_type = ifelse(name >= slice_num, "Pounds of Materials Used", "Barrels Produced"),
type = str_remove(type, ":")) %>%
select(data_type, everything(), -name) %>%
filter(!str_detect(type, "IN POUNDS|MATERIALS USED|MANUFACTURE OF BEER|BARRELS")) %>%
group_by(data_type) %>%
group_split()
summary(split_df)
```
#### Factor cleaning and final dataframes
We can see that the `split_df` object is a list of 2 tibbles/dataframes. We can now operate on the individual dataframes and finalize the factor cleaning and assignment to make the data a bit tidier and analysis ready.
```{r}
manufacture_df <- split_df[[1]] %>%
mutate(
tax_status = case_when(
type %in% c("In bottles and cans", "In kegs", "In barrels and kegs",
"Tax Determined, Premises Use") ~ "Taxable",
type == "Sub Total Taxable" ~ "Sub Total Taxable",
type %in% c("For export", "For vessels and aircraft",
"Consumed on brewery premises") ~ "Tax Free",
type == "Sub Total Tax-Free" ~ "Sub Total Tax-Free",
type %in% c("Production", "Total Removals",
"Stocks On Hand end-of-month:") ~ "Totals"
),
tax_rate = dplyr::if_else(year <= 2017, "$7/$18 per barrel", "$3.50/$16 per barrel")
) %>%
filter(!is.na(tax_status)) %>%
select(data_type, tax_status, everything())
```
```{r}
# clean up the material dataset
material_df <- split_df[[2]] %>%
mutate(
material_type = case_when(
str_detect(type, "Malt|Corn|Rice|Barley|Wheat") ~ "Grain Products",
str_detect(type, "Sugar|Hops|Other") ~ "Non-Grain Products",
str_detect(type, "Total") ~ type
)
) %>%
select(data_type, material_type, everything())
```
#### Print the dataframes
The manufacture dataframe now has the labels, factors, etc separated into nice columns, with the 4x columns for specific barrels produced.
```{r}
manufacture_df
```
The material dataframe now has the labels, factors, etc separated into nice columns, with the 4x columns for specific pounds of product used.
```{r}
material_df
```
### Finished Cleaning
We have now finished cleaning the manufacting and material dataframes! However, we did this all line-by-line without functions and would need to repeat this for the other 47 PDFs! Let's convert ALL that code into a function that outputs the final dataframes.
## Use a function
```{r}
# create a function that works for most years
get_beer_tables <- function(year, month) {
# read in the raw PDF
raw_text <- pdftools::pdf_text(glue::glue("pdfs/ttb_monthly_stats_{year}-{month}.pdf")) %>%
str_split("\n") %>%
unlist()
## Build Table
# find start of table
table_start <- stringr::str_which(raw_text, "MANUFACTURE OF BEER")
# End of table (drop all the asterisks and the other descriptors)
table_end <- stringr::str_which(raw_text, "Total Used")
# Trim the table to the start/end and drop whitespace at each line
table_trimmed <- raw_text[table_start:table_end] %>%
str_trim()
table_trimmed
### Remove all the extra whitespace
# Replace long spaces with a col break symbol
squished_table <- str_replace_all(table_trimmed, "\\s{2,}", "|") %>%
str_remove_all(",")
### Convert to tibble
# Convert to tibble
raw_df <- enframe(squished_table)
# split the rows into their columns
beer_df <- suppressWarnings(raw_df %>%
separate(value,
into = c("type", "month_current", "month_prior_year", "ytd_current", "ytd_prior_year"),
sep = "\\|"
) %>%
slice(-1) %>%
mutate_at(vars(month_current:ytd_prior_year), as.double) %>%
mutate(year = as.integer(year), month = as.integer(month)) %>%
select(year, month, type, everything()))
### Proper Cleaning
# ID the specific row/point to split the dataset at.
slice_num <- beer_df %>%
# find a string that has MATERIALS USED or IN POUNDS
# | means OR
filter(str_detect(type, "MATERIALS USED|IN POUNDS")) %>%
pull(name)
#### Split dataframe
# split data into materials vs barrels produced
split_df <- suppressWarnings(beer_df %>%
mutate(
data_type = ifelse(name >= slice_num, "Pounds of Materials Used", "Barrels Produced"),
type = str_remove(type, ":")
) %>%
select(data_type, everything(), -name) %>%
filter(!str_detect(type, "IN POUNDS|MATERIALS USED|MANUFACTURE OF BEER|BARRELS")) %>%
group_by(data_type) %>%
group_split())
#### Factor cleaning and final dataframes
# clean manufacture df
manufacture_df <- split_df[[1]] %>%
mutate(
tax_status = case_when(
type %in% c(
"In bottles and cans", "In kegs", "In barrels and kegs",
"Tax Determined, Premises Use"
) ~ "Taxable",
type == "Sub Total Taxable" ~ "Sub Total Taxable",
type %in% c(
"For export", "For vessels and aircraft",
"Consumed on brewery premises"
) ~ "Tax Free",
type == "Sub Total Tax-Free" ~ "Sub Total Tax-Free",
type %in% c(
"Production", "Total Removals",
"Stocks On Hand end-of-month:"
) ~ "Totals"
),
tax_rate = dplyr::if_else(year <= 2017, "$7/$18 per barrel", "$3.50/$16 per barrel")
) %>%
filter(!is.na(tax_status)) %>%
select(data_type, tax_status, everything())
# clean up the material dataset
material_df <- split_df[[2]] %>%
mutate(
material_type = case_when(
str_detect(type, "Malt|Corn|Rice|Barley|Wheat") ~ "Grain Products",
str_detect(type, "Sugar|Hops|Other") ~ "Non-Grain Products",
str_detect(type, "Total") ~ type
)
) %>%
select(data_type, material_type, everything())
# output a list of both dfs
list(manufacture_df, material_df)
}
```
Really the only code we have changed is we added a `glue` call to add the year, month to which PDF to read in, and we have the output as a list of both dataframes. Let's test our function!
```{r}
get_beer_tables(2011, "01")
```
Boom! Function is working for our example, let's try it out with more than 1 input via `purrr`!
## `purrr` - iteration without repetition
We'll be using `pmap()` to apply our function multiple times, where `pmap` can take any number of inputs. For example if we call `get_beer_tables()` via `pmap`, we can get our tables for that 1 year/month combo!
```{r}
# Quick test of purrr
pmap(list(2011, "02"), get_beer_tables)
```
However our goal is all the inputs at once! We can create a vector of the month inputs as character strings, and then use `tidyr::crossing()` to output all the possible combinations of year + month as a dataframe. Notice two columns, year and month with a length of 48 - equal to all of our PDFs!
```{r}
# add the month_num as vector
month_num <- c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12")
# use crossing to generate all combos for the data
# 2010 is missing, but as the data has prior year data we can theoretically
# add it back in after the fact
crossing(
year = c(2011:2014),
month = month_num
) %>% glimpse()
```
### All possible combos
We can use `tidyr::crossing()` again to generate the possible inputs and create the output dataframes as list column of two dataframes. Running this takes only about 2 seconds across the 48 PDFs! The output is not very exciting as the data is simply the year & month columns, plus a list-column called data. Let's get the final outputs!
```{r}
# add the month_num as vector
month_num <- c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12")
# use crossing to generate all combos for the data
df_2011_2014 <- crossing(year = c(2011:2014),
month = month_num) %>%
mutate(data = pmap(., get_beer_tables))
df_2011_2014
```
#### Final output
We can now get just the output data, drop the other columns. We're still working with list-columns, so let's get to the `manufacture_df` and `material_df`.
```{r}
final_output <- df_2011_2014 %>%
# grab the data into respective columns
mutate(manufacture_data = map(data, 1),
material_data = map(data, 2)) %>%
select(manufacture_data, material_data)
final_output
```
The manufacture dataframe can be combined as below.
```{r}
# Grab just the manufacture data
manufacture_df <- final_output %>%
select(manufacture_data) %>%
unnest(manufacture_data)
# Grab just the manufacture data
material_df <- final_output %>%
select(material_data) %>%
unnest(material_data)
```
And now we can look at the outputs!
#### Manufacture dataset
```{r}
glimpse(manufacture_df)
```
#### Material dataset
```{r}
glimpse(material_df)
```
# Do it all in 6 Lines of Code!
Now all of that could have been done in about 6 lines of `tidyverse` code since we created a function.
```{r}
# Use crossing to generate all 48 combos for the data
# Use purrr to read in, clean, and output the 96 tables from the 48 PDFs
final_output <- crossing(year = c(2011:2014), month = month_num) %>%
mutate(data = pmap(., get_beer_tables)) %>%
mutate(manufacture_data = map(data, 1), material_data = map(data, 2)) %>%
select(manufacture_data, material_data)
# Grab just the manufacture data
manufacture_df <- final_output %>% select(manufacture_data) %>% unnest(manufacture_data)
# Grab just the manufacture data
material_df <- final_output %>% select(material_data) %>% unnest(material_data)
```