In [1]:
library(tidyverse)
remotes::install_github("bradleyboehmke/harrypotter")
library(harrypotter)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
Downloading GitHub repo bradleyboehmke/harrypotter@HEAD



[36m──[39m [36mR CMD build[39m [36m─────────────────────────────────────────────────────────────────[39m
* checking for file ‘/tmp/RtmpXBIFb9/remotes1bf169dd1b1/bradleyboehmke-harrypotter-51f7146/DESCRIPTION’ ... OK
* preparing ‘harrypotter’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘harrypotter_0.1.0.tar.gz’



Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



# Lecture 14: Regular expressions

<div style="border: 1px double black; padding: 10px; margin: 10px">

**After today's lecture you will:**
* Understand basic regular expressions.
* Use regular expressions to extract data from text.
</div>

These notes correspond to Chapter 16 of your book.


## Exercise
What is the last sentence of ch1? Hint: get a substring using `str_sub` of the last 200 characters and then find the index of `.` using `str_locate` in that and get the substring again using the index position

In [2]:
ch1 = philosophers_stone[1]

#### Revisiting escape sequence `\`

We see that the print statement does not quite display the newline on the console

In [None]:
print('abc \n abc')

[1] "abc \n abc"


However using `cat` function gets us the new line

In [4]:
cat('abc \nabc')

abc 
abc

We can also use writeLines to get around this issue

In [6]:
writeLines('abc \tabc')

abc 	abc


#### raw string to the rescue

Sometimes adding the escape sequence to escape a series of special characters can become very confusing. Let us understand this example

In [None]:
tricky <- "double_quote <- \"\\\"\" # or '\"'
single_quote <- '\\\'' # or \"'\""

cat(tricky)

double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"

In [None]:
simple <- r"(double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'")"
cat(simple)

double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"

## Regular expressions
Regular expressions (regex, regexps) are a programming language that allows you to describe patterns in strings. They have a steep learning curve but are very powerful for working with text data. In this class we will just focus on the basics of regexps. A good tool for learning regexps is [regex101](https://regex101.com/), which lets you interactively edit and debug your regular expressions.

> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
>
> — Jamie Zawinski (famous nerd)

In these slides we will use the command `str_view` to understand how regular expressions work.

The most basic regular expression is a plain string. It will match if the other string contains it as a substring.

In [None]:
x = c("apple", "banana", "pear") %>% print
str_view(x, pattern = "an")

[1] "apple"  "banana" "pear"  


[90m[2] │[39m b[36m<an>[39m[36m<an>[39ma

Here `str_view` has matched our regexp (`"an"`) inside of the second string `banana` of the vector `x`.

In [None]:
fruit %>% print

 [1] "apple"             "apricot"           "avocado"          
 [4] "banana"            "bell pepper"       "bilberry"         
 [7] "blackberry"        "blackcurrant"      "blood orange"     
[10] "blueberry"         "boysenberry"       "breadfruit"       
[13] "canary melon"      "cantaloupe"        "cherimoya"        
[16] "cherry"            "chili pepper"      "clementine"       
[19] "cloudberry"        "coconut"           "cranberry"        
[22] "cucumber"          "currant"           "damson"           
[25] "date"              "dragonfruit"       "durian"           
[28] "eggplant"          "elderberry"        "feijoa"           
[31] "fig"               "goji berry"        "gooseberry"       
[34] "grape"             "grapefruit"        "guava"            
[37] "honeydew"          "huckleberry"       "jackfruit"        
[40] "jambul"            "jujube"            "kiwi fruit"       
[43] "kumquat"           "lemon"             "lime"             
[46] "loquat"            

In [7]:
str_view(fruit, 'berry')

[90m [6] │[39m bil[36m<berry>[39m
[90m [7] │[39m black[36m<berry>[39m
[90m[10] │[39m blue[36m<berry>[39m
[90m[11] │[39m boysen[36m<berry>[39m
[90m[19] │[39m cloud[36m<berry>[39m
[90m[21] │[39m cran[36m<berry>[39m
[90m[29] │[39m elder[36m<berry>[39m
[90m[32] │[39m goji [36m<berry>[39m
[90m[33] │[39m goose[36m<berry>[39m
[90m[38] │[39m huckle[36m<berry>[39m
[90m[50] │[39m mul[36m<berry>[39m
[90m[70] │[39m rasp[36m<berry>[39m
[90m[73] │[39m salal [36m<berry>[39m
[90m[76] │[39m straw[36m<berry>[39m

### Wildcards
Our first non-trivial regular expression will use a wildcard: `.`. Used inside of a regular expression, the period matches any single character:

In [9]:
str_view("else every eele etcetera", "e..e ")

[90m[1] │[39m [36m<else >[39mevery [36m<eele >[39metcetera

If we want to "extract" the first match we can use `str_extract()` instead:

In [12]:
str_extract("else every eele etcetera", "e..e ")

What about all the matches? You can guess the function here

In [None]:
str_extract_all("else every eele etcetera", "e..e")

Now let's return to another example using harrypotter: finding all capitalized words.

### Exercise
What is the first string that matches the pattern `H<any three characters>y` in Philosopher's Stone?

In [15]:
# H<3>y
str_extract_all(ch1, 'H...y')


### Character classes
Instead of matching anything using `.`, we often want to match a class of things: words, numbers, spaces, etc.
A "character class" is a special pattern that matches a collection of characters. There are four built-in character classes you should know:
- `\w`: matches any alphanumeric character and is equivalent to using [A-Za-z0-9_]
- `\s`: matches single space, tab, newline characters and is equivalent to using [\t\n\r\f\v]
- `\d`: matches any digit and is equivalent to using [0-9]
- `\b`: match a "word boundary" (more on this in a moment).



`\w` matches any word character:

In [18]:
str_view("this is a word", "\\w")

[90m[1] │[39m [36m<t>[39m[36m<h>[39m[36m<i>[39m[36m<s>[39m [36m<i>[39m[36m<s>[39m [36m<a>[39m [36m<w>[39m[36m<o>[39m[36m<r>[39m[36m<d>[39m

Note the additional level of escaping needed here: "\\w" gets parsed by R into the string `\w`:

In [19]:
writeLines("\\w")

\w


The `\w` is then parsed again by the regular expression.

If you have a missing backslash then..

In [20]:
writeLines("\w")

ERROR: ignored

The string "\w" is not valid in R, because there is no escape code "\w":

```
> "\w"
Error: '\w' is an unrecognized escape in character string starting ""\w"
Traceback:
```

In [None]:
# raw representation
str_view("this is a word", r"(\w)")

[90m[1] │[39m [36m<t>[39m[36m<h>[39m[36m<i>[39m[36m<s>[39m [36m<i>[39m[36m<s>[39m [36m<a>[39m [36m<w>[39m[36m<o>[39m[36m<r>[39m[36m<d>[39m

`\d` will match any digit:

In [21]:
str_view(c("number1", "two", "3hree"), "\\d")

[90m[1] │[39m number[36m<1>[39m
[90m[3] │[39m [36m<3>[39mhree

Similarly, `\s` will match whitespace (spaces, tabs and newlines):

In [22]:
y = c("spa ce", "hello\tworld", "multi\nline")
writeLines(y)
str_view(y, "\\s")

spa ce
hello	world
multi
line


[90m[1] │[39m spa[36m< >[39mce
[90m[2] │[39m hello[36m<[36m{\t}[39m>[39mworld
[90m[3] │[39m multi[36m<[39m
    [90m│[39m [36m>[39mline

You can also create your own character class using square brackets: `[abc]` will match *one of* `a`, `b`, or `c`. In other words, the 'width' of a character class is one character by default.

In [23]:
str_view(fruit, '[be]a')  # Match either 'b' or 'e' followed by a

[90m [4] │[39m [36m<ba>[39mnana
[90m[12] │[39m br[36m<ea>[39mdfruit
[90m[58] │[39m p[36m<ea>[39mch
[90m[59] │[39m p[36m<ea>[39mr
[90m[62] │[39m pin[36m<ea>[39mpple

We can use character classes to match the first capital letter of a capitalized word:

In [24]:
str_view(c("These", "are", "some Capitalized words"),
         "[ABCDEFGHIJKLMNOPQRSTUVWXYZ]")

[90m[1] │[39m [36m<T>[39mhese
[90m[3] │[39m some [36m<C>[39mapitalized words

We do not need to go to all the trouble of typing each capital letter. We can use the shortcut `[A-Z]` instead.

In [27]:
str_view(c("These", "are", "some Capitalized words"), "[a-e]")

[90m[1] │[39m Th[36m<e>[39ms[36m<e>[39m
[90m[2] │[39m [36m<a>[39mr[36m<e>[39m
[90m[3] │[39m som[36m<e>[39m C[36m<a>[39mpit[36m<a>[39mliz[36m<e>[39m[36m<d>[39m wor[36m<d>[39ms

### Word boundaries
A final character class we'll use frequently is `\b`, which stands for "word boundary". A word boundary matches the "edges" of a word:

In [None]:
str_view(c("Rafael Nadal", "Roger Federer", "Novak Djokovic"), "\\b")

[90m[1] │[39m [36m<>[39mRafael[36m<>[39m [36m<>[39mNadal[36m<>[39m
[90m[2] │[39m [36m<>[39mRoger[36m<>[39m [36m<>[39mFederer[36m<>[39m
[90m[3] │[39m [36m<>[39mNovak[36m<>[39m [36m<>[39mDjokovic[36m<>[39m

Every word has a word boundary on either side, so we can use this in combination with other character classes to match certain kinds of words in text.

## 🤔 Quiz

About how many words start with 'H' in `ch1`?

<ol style="list-style-type: upper-alpha;">
    <li>98</li>
    <li>100</li>
    <li>99</li>
    <li>1</li>
</ol>


In [31]:
#words starting with 'H'
str_count(ch1, '\\bH.')


In this exercise, we matched the pattern

    <word boundary><character H><any character (.)>

## Quantifiers
Now we can return to a question that we asked in the previous lecture: how many words are there in `ch1`? We did a crude approximation by counting the number of spaces, but we saw that this was not quite accurate when we used `\\w+` to find words. Now let us understand this second expression in detail

            \\w+ - <any number of word characters being together>

The four quantifiers you should know are:
- `?`: match zero or one of the preceding character.
- `+`: match one or more of the preceding character.
- `*`: match zero or more of the preceding character.
- `{x}`: match exactly `x` of the preceding character.
    - `{x,y}`: match between `x` and `y` of the preceding character.
    - `{x,}`: match at least `x` of the preceding character.

So, to count the number of words using the pattern shown above:

In [42]:
# match any number of word characters

str_extract_all(ch1, '\\b\\w{10,22}\\b')

In [35]:

str_count(ch1, '\\w+')

## 🤔 Quiz

How many words in `ch1` match the pattern:

    <word boundary><small case 'h'><any exactly 5 characters><word boundary>

(Example: `harry.` matches but ` I ` does not.)

<ol style="list-style-type: upper-alpha;">
    <li>100</li>
    <li>19</li>
    <li>98</li>
    <li>51</li>
</ol>


In [43]:
#
str_count(ch1, '\\bh\\w{5}\\b')

Find all the character names in Harry Potter. By matching all words that start with capital letters, we're off to a good start. But we pick up the beginning word of any sentence, resulting in a lot of false matches. Let's use quantifiers to restrict to only longer words, say capitalized words that are at least six characters long.

Our pattern becomes:

    <word boundary><Capital letter><at least five lowercase letters><word boundary>

In [48]:
# all cap words that have at least six characters

str_extract_all(ch1, '\\b[A-Z][a-z]{5,}\\b') %>% table

.
   Although     Bonfire    Borrowed     Bristol     Britain      Couldn 
          1           1           1           1           1           1 
    Dedalus      Diggle      Dudley  Dumbledore      Dundee     Dursley 
          1           1           9          37           1          47 
   Dursleys    Everyone     Exactly     Experts      Famous      Flocks 
          7           1           1           1           2           1 
     Godric   Grunnings      Hagrid      Harold      Harvey      Hollow 
          1           2          14           1           1           1 
     Howard      Inside     Instead      Little      London      Muggle 
          1           1           2           1           1           3 
    Muggles  Mysterious     Nothing      People     Perhaps     Petunia 
          6           1           1           2           1           3 
    Pomfrey      Potter     Potters      Privet   Professor      Really 
          1          11          10           8  

## Grouping
"Professor" is one of the most common capitalized words in this book. Is there a character named Professor, or is it just a title? Now let us try to match one or more capitalized words in a row. We can accomplish this by creating a *group*, and then applying a quantifier to it.

To create a group, I surround a part of my regexp with parentheses:

In [None]:
str_view("this will be grouped", "[a-z]+ ?")
str_view("this will be grouped", "([a-z]+ ?)")

The parentheses do not change the regular expression (but they are doing something else, which we will discuss shortly.) But now I can apply a quantifier to the whole group:

In [None]:
str_view("this will be grouped", "([a-z]+ ?)+")

So now we take the previous pattern and group it:

    (<word boundary><Capital letter><at least five lowercase letters><word boundary>){match 1+ times}

In [51]:
# match one or more cap words in a row
str_extract_all(ch1, '([A-Z][a-z]{5,} ?)+') %>% table

.
           Although              Bonfire             Borrowed  
                   1                    1                    1 
             Bristol              Britain               Couldn 
                   1                    1                    1 
      Dedalus Diggle               Dudley              Dudley  
                   1                    2                    7 
          Dumbledore          Dumbledore               Dundee  
                  16                   19                    1 
             Dursley             Dursley           Dursleyish  
                   9                   38                    1 
            Dursleys            Dursleys             Everyone  
                   2                    5                    1 
             Exactly             Experts               Famous  
                   1                    1                    2 
             Flocks                Godric             Gonagall 
                   1                  

In [54]:
str_locate(ch1, 'Bonfire')

str_sub(ch1, 8957, 8957 + 20)

start,end
8957,8963


## Negations
Earlier we looked at quotations. The first quotation in chapter 1 is:

In [None]:
str_sub(ch1, 2150, 2163)

How can we find other quotes? The pattern for a quote is a quotation mark, followed by any number of things that are not a quotation mark, followed by another quotation mark:

    <quotation mark><anything that is not a quotation mark><quotation mark>


To match this, we will use a *negation*. A negation is a character class that begins with the character "^". It matches anything that in *not* inside the character class:

In [None]:
str_view_all("match doesn't match", "[^aeiou]+")

[90m[1] │[39m [36m<m>[39ma[36m<tch d>[39moe[36m<sn't m>[39ma[36m<tch>[39m

To match a quotation, we'll input the pattern that we specified above:

In [56]:
str_view_all('"Here is a quotation", said the professor. "And here is another."',
             '"[^"]+"')

[90m[1] │[39m [36m<"Here is a quotation">[39m, said the professor. [36m<"And here is another.">[39m

## Backreferences
Parentheses define groups that can be referred to later in the match as `\1`, `\2` etc. This is called a backreference. For example:

    (.)\1

will match the same character repeated twice in a row:

In [61]:
"eyxyel"  %>% str_view("(.)x\\1")

[90m[1] │[39m e[36m<yxy>[39mel

## 🤔 Quiz

Which complete word does this regular expression match?:

```
(..).*\\1
```


<ol style="list-style-type: upper-alpha;">
    <li>Any word that starts and ends with the same character, e.g. `alpha`</li>
    <li>Any word that starts and ends with the same two characters, e.g. `church`</li>
    <li>Any word that ends with two characters that are found earlier in the string, e.g. `therefore`</li>
    <li>I hate regular expressions.</li>
</ol>

In [None]:
# what's it match?

## Anchors
Sometimes we want a match to occur at a particular position in the string. For example, "all words which start with b". For this we have the special anchor characters: `^` and `$`. The caret `^` matches the beginning of a string. The `$` matches the end.

In [None]:
x <- c('apple', 'banana', 'pear')
str_view(x, '^b')
str_view(x, 'r$')

## 🤔 Quiz

What does this regular expression match?:

```
^(\\w{2}).*\\1$
```


<ol style="list-style-type: upper-alpha;">
    <li>Any word that starts and ends with the same character, e.g. `alpha`</li>
    <li>Any word that starts and ends with the same two characters, e.g. `church`</li>
    <li>Any word that ends with a character that is found earlier in the string, e.g. `therefore`</li></li>
    <li>I hate regular expressions.</li>
</ol>

In [None]:
# solution


### `str_match`
`str_match(v, re)` will create a matrix out of the grouped matches in `re`.
The first column has the whole match, and additional columns are added for each character group.
If the pattern does not match, you will get `NA`s.

In [None]:
head(str_match(words, '^(.).*(.)$'))

0,1,2
,,
able,a,e
about,a,t
absolute,a,e
accept,a,t
account,a,t


### `str_replace`
`str_replace(v, re, rep)` will replace each match of `re` in `v` with `rep`. The most basic usage is as a sort of find and replace:

In [None]:
str_replace('Give me liberty or give me death', '\\w+$', 'pizza')