# Extract data encoded in characters 

In [1]:
library('tidyverse')

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.2.1     [32m✔[39m [34mpurrr  [39m 0.3.3
[32m✔[39m [34mtibble [39m 2.1.3     [32m✔[39m [34mdplyr  [39m 0.8.3
[32m✔[39m [34mtidyr  [39m 1.0.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


It's common for values (or column names) to include multiple pieces of information.

E.g. `incidents_85_99`

In [2]:
# ?extract

Use `regular expression` syntax to define how data is encoded.

- `.` (dot) = any character
- `*` = match expression zero or more times
- `+` = match expression one or more times

Parentheses group patterns to extract.

In [3]:
iris %>% head

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
<dbl>,<dbl>,<dbl>,<dbl>,<fct>
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa


`Sepal.Length`

`<part of plant>.<measurement>`

`"(.+)\\.(.+)"`

"backslash escape" means don't treat this as as pattern.

In [4]:
iris %>%
    as_tibble %>%
    pivot_longer(-one_of('Species'), names_to = 'name', values_to = 'value') %>%
    extract(name, c('part', 'meas'), "(.+)\\.(.+)") %>%
    print

[38;5;246m# A tibble: 600 x 4[39m
   Species part  meas   value
   [3m[38;5;246m<fct>[39m[23m   [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m  [3m[38;5;246m<dbl>[39m[23m
[38;5;250m 1[39m setosa  Sepal Length   5.1
[38;5;250m 2[39m setosa  Sepal Width    3.5
[38;5;250m 3[39m setosa  Petal Length   1.4
[38;5;250m 4[39m setosa  Petal Width    0.2
[38;5;250m 5[39m setosa  Sepal Length   4.9
[38;5;250m 6[39m setosa  Sepal Width    3  
[38;5;250m 7[39m setosa  Petal Length   1.4
[38;5;250m 8[39m setosa  Petal Width    0.2
[38;5;250m 9[39m setosa  Sepal Length   4.7
[38;5;250m10[39m setosa  Sepal Width    3.2
[38;5;246m# … with 590 more rows[39m


What about this?

`incidents_85_99`

In [5]:
df = data.frame(x = c('incidents_85_99', 'incidents_00_14', 'fatalities_00_14'))
df %>% extract(x, c('accident_type', 'year_range'), '(.*)_(.*_.*)')

accident_type,year_range
<chr>,<chr>
incidents,85_99
incidents,00_14
fatalities,00_14
