forked from stephaniehicks/jhustatcomputing2022
-
Notifications
You must be signed in to change notification settings - Fork 4
/
index.qmd
744 lines (496 loc) · 21.7 KB
/
index.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
---
title: "21 - Regular expressions"
author:
- name: Leonardo Collado Torres
url: http://lcolladotor.github.io/
affiliations:
- id: libd
name: Lieber Institute for Brain Development
url: https://libd.org/
- id: jhsph
name: Johns Hopkins Bloomberg School of Public Health Department of Biostatistics
url: https://publichealth.jhu.edu/departments/biostatistics
description: "Introduction to working with character strings and regular expressions in R"
categories: [module 5, week 6, tidyverse, R, programming, strings and regex]
---
*This lecture, as the rest of the course, is adapted from the version [Stephanie C. Hicks](https://www.stephaniehicks.com/) designed and maintained in 2021 and 2022. Check the recent changes to this file through the `r paste0("[GitHub history](https://github.com/lcolladotor/jhustatcomputing2023/commits/main/", basename(dirname(getwd())), "/", basename(getwd()), "/index.qmd)")`.*
<!-- Add interesting quote -->
# Pre-lecture materials
### Read ahead
::: callout-note
## Read ahead
**Before class, you can prepare by reading the following materials:**
1. <https://r4ds.had.co.nz/strings>
:::
### Acknowledgements
Material for this lecture was borrowed and adopted from
- <https://rdpeng.github.io/Biostat776/lecture-regular-expressions>
- <https://r4ds.had.co.nz/strings>
# Learning objectives
::: callout-note
# Learning objectives
**At the end of this lesson you will:**
- Understand what is a 'regular expression' and how to create one
- Learn the basics of searching for patterns in character strings in base R and the `stringr` R package in the `tidyverse`
- Use the built in character sets to search for patterns in strings including `"\n"`, `"\t"`, `"\w"`, `"\d"`, and `"\s"`
:::
# Introduction
## regex basics
A **regular expression** (also known as a "regex" or "regexp") is a concise language for describing patterns in character strings.
Regex could be **patterns that could be contained within another string**.
::: callout-tip
### Example
For example, if we wanted to search for the pattern "ai" in the character string "The rain in Spain", we see it appears twice!
"The r**ai**n in Sp**ai**n"
:::
Generally, a regular expression can be used for e.g.
- **searching for a pattern or string** within another string (e.g searching for the string "a" in the string "Maryland")
- **replacing one part of a string** with another string (e.g replacing the string "t" with "p" in the string "hot" where you are changing the string "hot" to "hop")
If you have never worked with regular expressions, it can seem like maybe a baby hit the keys on your keyboard (complete gibberish), but it will slowly make sense once you learn the syntax.
Soon you will **be able create incredibly powerful regular expressions** in your day-to-day work.
## string basics
In R, you can **create (character) strings** with either single quotes (`'hello!'`) or double quotes (`"hello!"`) -- no difference (not true for other languages!).
I **recommend using the double quotes**, unless you want to create a string with multiple `"`.
```{r}
string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
```
::: callout-tip
### Pro-tip
Strings can be tricky when executing them. If you forget to close a quote, you'll see `+`
``` r
> "This is a string without a closing quote
+
+
+ HELP I'M STUCK
```
If this happen to you, take a deep breath, press `Escape` and try again.
:::
Multiple strings are often stored in a character vector, which you can create with `c()`:
```{r}
c("one", "two", "three")
```
# `grepl()`
One of the **most basic functions in R that uses regular expressions** is the `grepl(pattern, x)` function, which takes **two arguments** and **returns a logical**:
1. A regular expression (`pattern`)
2. A string to be searched (`x`)
In case you are curious, "grepl" literally translates to "grep logical".
If the string (`x`) contains the specified regular expression (`pattern`), then `grepl()` will return `TRUE`, otherwise it will return `FALSE`.
Let's take a look at one example:
```{r}
regular_expression <- "a"
string_to_search <- "Maryland"
grepl(pattern = regular_expression, x = string_to_search)
```
In the example above, we specify the regular expression `"a"` and store it in a variable called `regular_expression`.
::: callout-tip
### Note
**Remember** that regular expressions are just strings!
:::
We also store the string `"Maryland"` in a variable called `string_to_search`.
The regular expression `"a"` represents a single occurrence of the character `"a"`. Since `"a"` is contained within `"Maryland"`, `grepl()` returns the value `TRUE`.
::: callout-tip
### Example
Let's try another simple example:
```{r}
regular_expression <- "u"
string_to_search <- "Maryland"
grepl(pattern = regular_expression, x = string_to_search)
```
The regular expression `"u"` represents a single occurrence of the character `"u"`, which is not a sub-string of `"Maryland"`, therefore `grepl()` returns the value `FALSE`.
:::
Regular expressions can be much longer than single characters. You could for example search for smaller strings inside of a larger string:
```{r}
grepl("land", "Maryland")
grepl("ryla", "Maryland")
grepl("Marly", "Maryland")
grepl("dany", "Maryland")
```
Since `"land"` and `"ryla"` are sub-strings of `"Maryland"`, `grepl()` returns `TRUE`, however when a regular expression like `"Marly"` or `"dany"` is searched `grepl()` returns `FALSE` because neither are sub-strings of `"Maryland"`.
::: callout-tip
### Introduce the US states dataset `state.name`
There is a dataset that comes with R called `state.name` which is a vector of strings, one for each state in the United States of America.
We are going to use this vector in several of the following examples.
```{r}
head(state.name)
length(state.name)
```
:::
Next, we will build a **regular expression for identifying several strings in this vector** of character strings, specifically a regular expression that will match names of states that both start and end with a vowel.
The state name could start and end with any vowel, so we will not be able to match exact sub-strings like in the previous examples. Thankfully we can use **metacharacters** to look for vowels and other parts of strings.
## metacharacters
The first metacharacter that we will discuss is `"."`.
The metacharacter that only consists of a period **represents any character other than a new line** (we will discuss new lines soon).
::: callout-tip
### Example
Let's take a look at some examples using the period regex:
```{r}
grepl(".", "Maryland")
grepl(".", "*&2[0+,%<@#~|}")
grepl(".", "")
```
:::
As you can see the **period metacharacter is very liberal**.
This metacharacter is **most useful when you do not care about a set of characters** in a regular expression.
::: callout-tip
### Example
Here is another example
```{r}
grepl("a.b", c("aaa", "aab", "abb", "acadb"))
```
In the case above, `grepl()` returns `TRUE` for all strings that contain an `a` followed by any other character followed by a `b`.
:::
## repetition
You can specify a regular expression that **contains a certain number of characters or metacharacters** using the **enumeration metacharacters** (or sometimes called **quantifiers**).
- `+`: indicates that **one or more of the preceding expression** should be present (or matches at least 1 time)
- `*`: indicates that **zero or more of the preceding expression** is present (or matches at least 0 times)
- `?`: indicates that **zero or 1 of the preceding expression is not present or present at most 1 time** (or matches between 0 and 1 times)
::: callout-tip
### Example
Let's take a look at some examples using these metacharacters:
```{r}
# Does "Maryland" contain one or more of "a" ?
grepl("a+", "Maryland")
# Does "Maryland" contain one or more of "x" ?
grepl("x+", "Maryland")
# Does "Maryland" contain zero or more of "x" ?
grepl("x*", "Maryland")
```
:::
If you want to do more than one character, you need to wrap it in `()`.
```{r}
# Does "Maryland" contain zero or more of "x" ?
grepl("(xx)*", "Maryland")
```
::: callout-note
### Question
Let's practice a few out together. Make the following regular expressions for the character string "spookyhalloween":
1. Does "zz" appear 1 or more times?
2. Does "ee" appear 1 or more times?
3. Does "oo" appear 0 or more times?
4. Does "ii" appear 0 or more times?
```{r}
## try it out
```
:::
You can also **specify exact numbers of expressions** using curly brackets `{}`.
- `{n}`: exactly n
- `{n,}`: n or more
- `{,m}`: at most m
- `{n,m}`: between n and m
For example `"a{5}"` specifies "a exactly five times", `"a{2,5}"` specifies "a between 2 and 5 times," and `"a{2,}"` specifies "a at least 2 times." Let's take a look at some examples:
```{r}
# Does "Mississippi" contain exactly 2 adjacent "s" ?
grepl("s{2}", "Mississippi")
# This is equivalent to the expression above:
grepl("ss", "Mississippi")
# Does "Mississippi" contain between 1 and 3 adjacent "s" ?
grepl("s{1,3}", "Mississippi")
# Does "Mississippi" contain between 2 and 3 adjacent "i" ?
grepl("i{2,3}", "Mississippi")
# Does "Mississippi" contain between 2 adjacent "iss" ?
grepl("(iss){2}", "Mississippi")
# Does "Mississippi" contain between 2 adjacent "ss" ?
grepl("(ss){2}", "Mississippi")
# Does "Mississippi" contain the pattern of an "i" followed by
# 2 of any character, with that pattern repeated three times adjacently?
grepl("(i.{2}){3}", "Mississippi")
```
::: callout-note
### Question
Let's practice a few out together. Make the following regular expressions for the character string "spookyspookyhalloweenspookyspookyhalloween":
1. Search for "spooky" exactly 2 times. What about 3 times?
2. Search for "spooky" exactly 2 times followed by any character of length 9 (i.e. "halloween").
3. Same search as above, but search for that twice in a row.
4. Same search as above, but search for that three times in a row.
```{r}
## try it out
```
:::
## capture group
In the examples above, I used parentheses `()` to create a **capturing group**. A capturing group allows you to use quantifiers on other regular expressions.
In the "Mississippi" example, I first created the regex `"i.{2}"` which matches `i` followed by any two characters ("iss" or "ipp"). Then, I used a capture group to wrap that regex, and to specify exactly three adjacent occurrences of that regex.
You can specify **sets of characters** (or character sets or character classes) with regular expressions, some of which come built in, but you can build your own **character sets** too.
More on character sets next.
## character sets
First, we will discuss the built in **character sets**:
- words (`"\\w"`) = **Words** specify **any letter, digit, or a underscore**
- digits (`"\\d"`) = **Digits** specify the **digits 0 through 9**
- whitespace characters (`"\\s"`) = **Whitespace** specifies **line breaks, tabs, or spaces**
Each of these character sets have their own **compliments**:
- not words (`"\\W"`)
- not digits (`"\\D"`)
- not whitespace characters (`"\\S"`)
Each specifies all of the characters not included in their corresponding character sets.
::: callout-tip
### Interesting fact
Technically, you are using the a character set `"\d"` or `"\s"` (with only one black slash), but because you are using this character set in a string, you need the second `\` to escape the string. So you will type `"\\d"` or `"\\s"`.
```{r}
#| error: true
"\\d"
```
So for example, to include a literal single or double quote in a string you can use `\` to "escape" the string and being able to include a single or double quote:
```{r}
double_quote <- "\""
double_quote
single_quote <- "'"
single_quote
```
That means if you want to include a literal backslash, you will need to double it up: `"\\"`.
:::
In fact, putting **two backslashes before any punctuation mark that is also a metacharacter** indicates that you are **looking for the symbol and not the metacharacter meaning**.
For example `"\\."` indicates you are trying to match a period in a string. Let's take a look at a few examples:
```{r}
grepl("\\+", "tragedy + time = humor")
grepl("\\.", "https://publichealth.jhu.edu")
```
::: callout-tip
### Beware
The printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use `writeLines()`:
```{r}
x <- c("\'", "\"", "\\")
x
writeLines(x)
```
:::
There are a handful of **other special characters**. The most common are
- `"\n"`: newline
- `"\t"`: tab,
but you can see the complete list by requesting help (run the following in the console and a help file will appear:
```{r}
#| eval: false
?"'"
```
You will also sometimes see strings like "\u00b5", this is a way of writing non-English characters that works on all platforms:
```{r}
x <- c("\\t", "\\n", "\u00b5")
x
writeLines(x)
```
::: callout-tip
### Example
Let's take a look at a few examples of built in character sets: `"\w"`, `"\d"`, `"\s"`.
```{r}
grepl("\\w", "abcdefghijklmnopqrstuvwxyz0123456789")
grepl("\\d", "0123456789")
# "\n" is the metacharacter for a new line
# "\t" is the metacharacter for a tab
grepl("\\s", "\n\t ")
grepl("\\d", "abcdefghijklmnopqrstuvwxyz")
grepl("\\D", "abcdefghijklmnopqrstuvwxyz")
grepl("\\w", "\n\t ")
```
:::
## brackets
You can also **specify specific character sets** using **straight brackets** `[]`.
For example a **character set of just the vowels** would look like: `"[aeiou]"`.
```{r}
grepl("[aeiou]", "rhythms")
```
You can find the complement to a specific character by putting a carrot `^` after the first bracket. For example `"[^aeiou]"` matches all characters except the lowercase vowels.
```{r}
grepl("[^aeiou]", "rhythms")
```
## ranges
You can also **specify ranges of characters** using a **hyphen** `-` inside of the brackets.
For example:
- `"[a-m]"` matches all of the lowercase characters between `a` and `m`
- `"[5-8]"` matches any digit between 5 and 8 inclusive
::: callout-tip
### Example
Let's take a look at some examples using custom character sets:
```{r}
grepl("[a-m]", "xyz")
grepl("[a-m]", "ABC")
grepl("[a-mA-M]", "ABC")
```
:::
## beginning and end
There are also metacharacters for **matching the beginning** and **the end of a string** which are `"^"` and `"$"` respectively.
Let's take a look at a few examples:
```{r}
grepl("^a", c("bab", "aab"))
grepl("b$", c("bab", "aab"))
grepl("^[ab]*$", c("bab", "aab", "abc"))
```
## OR metacharacter
The last metacharacter we will discuss is the **OR metacharacter** (`"|"`).
The OR metacharacter **matches either the regex on the left or the regex on the right** side of this character. A few examples:
```{r}
grepl("a|b", c("abc", "bcd", "cde"))
grepl("North|South", c("South Dakota", "North Carolina", "West Virginia"))
```
## `state.name` example
::: callout-tip
### Example
Finally, we have learned enough to create a regular expression that matches all state names that both begin and end with a vowel:
1. We match the beginning of a string.
2. We create a character set of just capitalized vowels.
3. We specify one instance of that set.
4. Then any number of characters until:
5. A character set of just lowercase vowels.
6. We specify one instance of that set.
7. We match the end of a string.
```{r}
start_end_vowel <- "^[AEIOU]{1}.+[aeiou]{1}$"
vowel_state_lgl <- grepl(start_end_vowel, state.name)
head(vowel_state_lgl)
state.name[vowel_state_lgl]
```
:::
Below is a table of several important metacharacters:
```{r}
#| echo: false
library(knitr)
mc_tibl <- data.frame(
Metacharacter =
c(
".", "\\\\w", "\\\\W", "\\\\d", "\\\\D",
"\\\\s", "\\\\S", "[xyz]", "[^xyz]", "[a-z]",
"^", "$", "\\\\n", "+", "*", "?", "|", "{5}", "{2, 5}",
"{2, }"
),
Meaning =
c(
"Any Character", "A Word", "Not a Word", "A Digit", "Not a Digit",
"Whitespace", "Not Whitespace", "A Set of Characters",
"Negation of Set", "A Range of Characters",
"Beginning of String", "End of String", "Newline",
"One or More of Previous", "Zero or More of Previous",
"Zero or One of Previous", "Either the Previous or the Following",
"Exactly 5 of Previous", "Between 2 and 5 or Previous",
"More than 2 of Previous"
),
stringsAsFactors = FALSE
)
kable(mc_tibl, align = "c")
```
# Other regex in base R
So far we've been using `grepl()` to see if a regex matches a string. There are a few other built in regex functions you should be aware of.
First, we will review our workhorse of this lesson, `grepl()`, which stands for "grep logical."
```{r}
grepl("[Ii]", c("Hawaii", "Illinois", "Kentucky"))
```
## `grep()`
Then, there is old fashioned `grep(pattern, x)`, which **returns the indices of the vector** that match the regex:
```{r}
grep(pattern = "[Ii]", x = c("Hawaii", "Illinois", "Kentucky"))
```
## `sub()`
The `sub(pattern, replacement, x)` function takes as arguments a regex, a "replacement," and a vector of strings. This function will **replace the first instance of that regex found in each string**.
```{r}
sub(pattern = "[Ii]", replacement = "1", x = c("Hawaii", "Illinois", "Kentucky"))
```
## `gsub()`
The `gsub(pattern, replacement, x)` function is nearly the same as `sub()` except it will **replace every instance of the regex** that is matched in each string.
```{r}
gsub("[Ii]", "1", c("Hawaii", "Illinois", "Kentucky"))
```
## `strsplit()`
The `strsplit(x, split)` function will **split up strings** (`split`) according to the provided regex (`x`) .
If `strsplit()` is provided with a vector of strings it will return a list of string vectors.
```{r}
two_s <- state.name[grep("ss", state.name)]
two_s
strsplit(x = two_s, split = "ss")
```
# The stringr package
The [`stringr`](https://github.com/hadley/stringr) package, written by [Hadley Wickham](http://hadley.nz/), is part of the [Tidyverse](https://twitter.com/hadleywickham/status/751805589425000450) group of R packages.
This package **takes a "data first" approach to functions involving regex**, so usually the string is the first argument and the regex is the second argument.
The majority of the function names in `stringr` **begin** with `str_*()`.
![](https://raw.githubusercontent.com/rstudio/cheatsheets/master/pngs/thumbnails/strings-cheatsheet-thumbs.png){preview="TRUE"}
\[**Source**: <https://stringr.tidyverse.org> \]
## `str_extract`
The `str_extract(string, pattern)` function returns the sub-string of a string (`string`) that matches the provided regular expression (`pattern`).
```{r}
library(stringr)
state_tbl <- paste(state.name, state.area, state.abb)
head(state_tbl)
str_extract(state_tbl, "[0-9]+")
```
## `str_detect`
The `str_detect(string, pattern)` is equivalent to `grepl(pattern,x)`:
```{r}
str_detect(state_tbl, "[0-9]+")
grepl("[0-9]+", state_tbl)
```
It detects the presence or absence of a pattern in a string.
## `str_order`
The `str_order(x)` function returns a numeric vector that corresponds to the alphabetical order of the strings in the provided vector (`x`).
```{r}
head(state.name)
str_order(state.name)
head(state.abb)
str_order(state.abb)
```
## `str_replace`
The `str_replace(string, pattern, replace)` is equivalent to `sub(pattern, replacement, x)`:
```{r}
str_replace(string = state.name, pattern = "[Aa]", replace = "B")
sub(pattern = "[Aa]", replacement = "B", x = state.name)
```
## `str_pad`
The `str_pad(string, width, side, pad)` function pads strings (`string`) with other characters, which is often useful when the string is going to be eventually printed for a person to read.
```{r}
str_pad("Thai", width = 8, side = "left", pad = "-")
str_pad("Thai", width = 8, side = "right", pad = "-")
str_pad("Thai", width = 8, side = "both", pad = "-")
```
The `str_to_title(string)` function acts just like `tolower()` and `toupper()` except it puts strings into Title Case.
```{r}
cases <- c("CAPS", "low", "Title")
str_to_title(cases)
```
## `str_trim`
The `str_trim(string)` function deletes white space from both sides of a string.
```{r}
to_trim <- c(" space", "the ", " final frontier ")
str_trim(to_trim)
```
## `str_wrap`
The `str_wrap(string)` function inserts newlines in strings so that when the string is printed each line's length is limited.
```{r}
pasted_states <- paste(state.name[1:20], collapse = " ")
cat(str_wrap(pasted_states, width = 80))
cat(str_wrap(pasted_states, width = 30))
```
## `word`
The `word()` function allows you to index each word in a string as if it were a vector.
```{r}
a_tale <- "It was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness"
word(a_tale, 2)
word(a_tale, end = 3) # end = last word to extract
word(a_tale, start = 11, end = 15) # start = first word to extract
```
# Post-lecture materials
### Final Questions
Here are some post-lecture questions to help you think about the material discussed.
::: callout-note
### Questions
There is a corpus of common words here:
```{r}
head(stringr::words)
length(stringr::words)
```
1. Using `stringr::words`, create regular expressions that find all words that:
- Start with "y".
- End with "x"
- Are exactly three letters long. (Don't cheat by using str_length()!)
- Have seven letters or more.
2. Using the same `stringr::words`, create regular expressions to find all words that:
- Start with a vowel.
- That only contain consonants. (Hint: thinking about matching "not"-vowels.)
- End with `ed`, but not with `eed`.
- End with `ing` or `ise`.
:::
### Additional Resources
::: callout-tip
- <https://stringr.tidyverse.org>
- <https://rdpeng.github.io/Biostat776/lecture-regular-expressions>
- <https://r4ds.had.co.nz/strings>
:::
# R session information
```{r}
options(width = 120)
sessioninfo::session_info()
```