In [30]:
install.packages('tidyverse') # and restart your runtime if running on colab
library(tidyverse)
remotes::install_github("bradleyboehmke/harrypotter")
library(harrypotter)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘textshaping’, ‘ragg’


“installation of package ‘textshaping’ had non-zero exit status”
“installation of package ‘ragg’ had non-zero exit status”
“installation of package ‘tidyverse’ had non-zero exit status”
Skipping install of 'harrypotter' from a github remote, the SHA1 (51f71461) has not changed since last install.
  Use `force = TRUE` to force installation



# Lecture 13: Regular expressions

<div style="border: 1px double black; padding: 10px; margin: 10px">

**After today's lecture you will:**
* Understand basic regular expressions.
* Use regular expressions to extract data from text.
</div>

These notes correspond to Chapter 16 of your book.


## Regular expressions
Regular expressions (regex, regexps) are a programming language that allows you to describe patterns in strings. They have a steep learning curve but are very powerful for working with text data. In this class we will just focus on the basics of regexps. A good tool for learning regexps is [regex101](https://regex101.com/), which lets you interactively edit and debug your regular expressions.

> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. 
>
> — Jamie Zawinski (famous nerd)

In these slides we will use the command `str_view` to understand how regular expressions work. To get `str_view` working on Colab you need to install the latest versions of `tidyverse`:

The most basic regular expression is a plain string. It will match if the other string contains it as a substring.

In [31]:
x = c("apple", "banana", "pear") %>% print
str_view(x, pattern = "an")

[1] "apple"  "banana" "pear"  


[90m[2] │[39m b[36m<an>[39m[36m<an>[39ma

Here `str_view` has matched our regexp (`"an"`) inside of the second string `banana` of the vector `x`.

In [None]:
fruit

In [32]:
str_view(fruit, 'berry')

[90m [6] │[39m bil[36m<berry>[39m
[90m [7] │[39m black[36m<berry>[39m
[90m[10] │[39m blue[36m<berry>[39m
[90m[11] │[39m boysen[36m<berry>[39m
[90m[19] │[39m cloud[36m<berry>[39m
[90m[21] │[39m cran[36m<berry>[39m
[90m[29] │[39m elder[36m<berry>[39m
[90m[32] │[39m goji [36m<berry>[39m
[90m[33] │[39m goose[36m<berry>[39m
[90m[38] │[39m huckle[36m<berry>[39m
[90m[50] │[39m mul[36m<berry>[39m
[90m[70] │[39m rasp[36m<berry>[39m
[90m[73] │[39m salal [36m<berry>[39m
[90m[76] │[39m straw[36m<berry>[39m

### Wildcards
Our first non-trivial regular expression will use a wildcard: `.`. Used inside of a regular expression, the period matches any single character:

In [33]:
str_view("else every eele etcetera", "e..e") 

[90m[1] │[39m [36m<else>[39m every [36m<eele>[39m [36m<etce>[39mtera

If we want to "extract" the first match we can use `str_extract()` instead:

In [35]:
str_extract_all("else every eele etcetera", "e..e ") 

Now let's return to another example from last lecture: finding all capitalized words.

In [36]:
remotes::install_github("bradleyboehmke/harrypotter")
library(harrypotter)
str(philosophers_stone)
ch1 <- philosophers_stone[1]

Skipping install of 'harrypotter' from a github remote, the SHA1 (51f71461) has not changed since last install.
  Use `force = TRUE` to force installation



 chr [1:17] "THE BOY WHO LIVED　　Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfe"| __truncated__ ...


### Exercise
What is the first string that matches the pattern `H<any three characters>y` in Philosopher's Stone?

In [38]:
# H<3>y
str_extract_all(ch1, 'H...y')

### Character classes
Instead of matching anything using `.`, we often want to match a class of things: words, numbers, spaces, etc.
A "character class" is a special pattern that matches a collection of characters. There are four built-in character classes you should know:
- `\w`: match any word-like character.
- `\s`: match any whitespace character.
- `\d`: match any digit.
- `\b`: match a "word boundary" (more on this in a moment).



`\w` matches any word character:

In [47]:
str_view("this is a word 123", "\\w")

[90m[1] │[39m [36m<t>[39m[36m<h>[39m[36m<i>[39m[36m<s>[39m [36m<i>[39m[36m<s>[39m [36m<a>[39m [36m<w>[39m[36m<o>[39m[36m<r>[39m[36m<d>[39m [36m<1>[39m[36m<2>[39m[36m<3>[39m

Note the additional level of escaping needed here: "\\w" gets parsed by R into the string `\w`:

In [48]:
writeLines("\\w")

\w


The `\w` is then parsed again by the regular expression.

The string "\w" is not valid in R, because there is no escape code "\w":

```
> "\w"
Error: '\w' is an unrecognized escape in character string starting ""\w"
Traceback:
```

`\d` will match any digit:

In [49]:
str_view(c("number1", "two", "3hree"), "\\d")

[90m[1] │[39m number[36m<1>[39m
[90m[3] │[39m [36m<3>[39mhree

Similarly, `\s` will match whitespace (spaces, tabs and newlines):

In [50]:
y = c("spa ce", "hello\tworld", "multi\nline")
writeLines(y)
str_view(y, "\\s")

spa ce
hello	world
multi
line


[90m[1] │[39m spa[36m< >[39mce
[90m[2] │[39m hello[36m<[36m{\t}[39m>[39mworld
[90m[3] │[39m multi[36m<[39m
    [90m│[39m [36m>[39mline

You can also create your own character class using square brackets: `[abc]` will match *one of* `a`, `b`, or `c`. In other words, the 'width' of a character class is one character by default.

In [57]:
str_view(fruit, '[bne]..[nd]')  # Match either 'b' or 'e' followed by a

[90m[13] │[39m canary m[36m<elon>[39m
[90m[18] │[39m cl[36m<emen>[39mtine
[90m[37] │[39m ho[36m<neyd>[39mew
[90m[44] │[39m l[36m<emon>[39m
[90m[72] │[39m rock m[36m<elon>[39m
[90m[78] │[39m tang[36m<erin>[39me
[90m[80] │[39m waterm[36m<elon>[39m

We can use character classes to match the first capital letter of a capitalized word:

In [58]:
str_view(c("These", "are", "some Capitalized words"),
         "[ABCDEFGHIJKLMNOPQRSTUVWXYZ]")

[90m[1] │[39m [36m<T>[39mhese
[90m[3] │[39m some [36m<C>[39mapitalized words

We do not need to go to all the trouble of typing each capital letter. We can use the shortcut `[A-Z]` instead.

In [60]:
str_view(c("These", "are", "some Capitalized words"), "[A-D]")

[90m[3] │[39m some [36m<C>[39mapitalized words

In [62]:
str_view('12346789', '[1-4]')

[90m[1] │[39m [36m<1>[39m[36m<2>[39m[36m<3>[39m[36m<4>[39m6789

In [65]:
str_view(c("These", "are", "some Capitalized words", 'xyz'), "[a-g]")

[90m[1] │[39m Th[36m<e>[39ms[36m<e>[39m
[90m[2] │[39m [36m<a>[39mr[36m<e>[39m
[90m[3] │[39m som[36m<e>[39m C[36m<a>[39mpit[36m<a>[39mliz[36m<e>[39m[36m<d>[39m wor[36m<d>[39ms

### Word boundaries
A final character class we'll use frequently is `\b`, which stands for "word boundary". A word boundary matches the "edges" of a word:

In [66]:
str_view(c("Rafael Nadal", "Roger Federer", "Novak Djokovic"), "\\b")

[90m[1] │[39m [36m<>[39mRafael[36m<>[39m [36m<>[39mNadal[36m<>[39m
[90m[2] │[39m [36m<>[39mRoger[36m<>[39m [36m<>[39mFederer[36m<>[39m
[90m[3] │[39m [36m<>[39mNovak[36m<>[39m [36m<>[39mDjokovic[36m<>[39m

Every word has a word boundary on either side, so we can use this in combination with other character classes to match certain kinds of words in text.

## 🤔 Quiz

About how many 5-letter words are there in `ch1`?

<ol style="list-style-type: upper-alpha;">
    <li>Less than 100</li>
    <li>100-299</li>
    <li>300-599</li>
    <li>600 or more</li>
</ol>


In [75]:
# 5-letter words
str_count(ch1, '\\b\\w\\w\\w\\w\\w\\b')

In this exercise, we matched the pattern 

    <word boundary><five word characters><word boundary>.

## Quantifiers
Now we can return to a question that we asked in the previous lecture: how many words are there in `ch1`? We did a crude approximation by counting the number of spaces, but we saw that this double-counted certain words. A better way is to count how many times the following pattern matches:

    <word boundary><any number of word characters><word boundary>

The four quantifiers you should know are:
- `?`: match zero or one of the preceding character.
- `+`: match one or more of the preceding character.
- `*`: match zero or more of the preceding character.
- `{x}`: match exactly `x` of the preceding character.
    - `{x,y}`: match between `x` and `y` of the preceding character.
    - `{x,}`: match at least `x` of the preceding character.

So, to count the number of words using the pattern shown above:

In [77]:
# match any number of characters inside of word boundaries.

str_count(ch1, '\\b\\w+\\b')

In [82]:

str_extract_all(ch1, '\\b\\w+\\b') %>% table

.
           a            A         able        about        About        above 
         105            7            2           12            2            1 
      across          act       acting     admiring       affect        after 
           2            1            1            1            1            2 
       After    afternoon        again          age        agree          air 
           2            1            3            1            1            3 
       Albus          all          All      allowed       almost         also 
           3           30            2            2            5            2 
    although     Although       always       amount        amuse           an 
           1            1            2            1            1            7 
         and          And      angrily        angry      another       answer 
          95            7            2            1            3            1 
     anxious          any       anyone     anythin

In [78]:
str_count(ch1, '\\b\\w{5}\\b')

In [80]:
str_extract_all(ch1, '\\b\\w{5}\\b')

## 🤔 Quiz

How many words in `ch1` match the pattern:

    <word boundary><capital letter><at least one lowercase letter><word boundary>

(Example: `Harry.` matches but ` I ` does not.)

<ol style="list-style-type: upper-alpha;">
    <li>Less than 100</li>
    <li>100-299</li>
    <li>300-599</li>
    <li>600 or more</li>
</ol>


In [79]:
# cap words
str_count(ch1, '\\b[A-Z][a-z]+\\b')

Let's return to an example from last lecture: find all the character names in Harry Potter. By matching all words that start with capital letters, we're off to a good start. But we pick up the beginning word of any sentence, resulting in a lot of false matches. Let's use quantifiers to restrict to only longer words, say capitalized words that are at least six characters long.

Our pattern becomes:

    <word boundary><Capital letter><at least five lowercase letters><word boundary>

In [91]:
# all starting with cap words that have at least six characters

str_extract_all(ch1, '[A-Z][a-z]{6,}') %>% table

.
   Although     Bonfire    Borrowed     Bristol     Britain     Dedalus 
          1           1           1           1           1           1 
 Dumbledore     Dursley  Dursleyish    Dursleys    Everyone     Exactly 
         37          47           1           7           1           1 
    Experts    Gonagall   Grunnings     Instead     Muggles  Mysterious 
          1          26           2           2           6           1 
    Nothing     Perhaps     Petunia     Pomfrey     Potters   Professor 
          1           1           3           1          10          30 
    Rejoice    Shooting     Tuesday Underground     Viewers   Voldemort 
          1           2           1           1           1           6 
  Yorkshire 
          1 

## Grouping
In the previous exercise we found that "Professor" is one of the most common capitalized words. Is there a character named Professor, or is it just a title? Now let us try to match one or more capitalized words in a row. We can accomplish this by creating a *group*, and then applying a quantifier to it. 

To create a group, I surround a part of my regexp with parentheses:

In [None]:
str_view("this will be grouped", "[a-z]+ ?")
str_view("this will be grouped", "([a-z]+ ?)")

The parentheses do not change the regular expression (but they are doing something else, which we will discuss shortly.) But now I can apply a quantifier to the whole group:

In [None]:
str_view("this will be grouped", "([a-z]+ ?)+")

So now we take the previous pattern and group it:

    (<word boundary><Capital letter><at least five lowercase letters><word boundary>){match 1+ times}

In [93]:
# match one or more cap words in a row
str_extract_all(ch1, '([A-Z][a-z]{6,} ?)+') %>% table

.
           Although              Bonfire             Borrowed  
                   1                    1                    1 
             Bristol              Britain             Dedalus  
                   1                    1                    1 
          Dumbledore          Dumbledore               Dursley 
                  16                   19                    9 
            Dursley           Dursleyish              Dursleys 
                  38                    1                    2 
           Dursleys             Everyone               Exactly 
                   5                    1                    1 
            Experts              Gonagall            Gonagall  
                   1                   10                   16 
           Grunnings           Grunnings              Instead  
                   1                    1                    2 
             Muggles             Muggles           Mysterious  
                   1                  

## Negations
Earlier we looked at quotations. The first quotation in chapter 1 is:

In [None]:
str_sub(ch1, 2150, 2163)

How can we find other quotes? The pattern for a quote is a quotation mark, followed by any number of things that are not a quotation mark, followed by another quotation mark:

    <quotation mark><anything that is not a quotation mark><quotation mark>


In [94]:
str_extract_all(ch1, '\"[^"]+\"') 

To match this, we will use a *negation*. A negation is a character class that begins with the character "^". It matches anything that in *not* inside the character class:

In [None]:
str_view_all("match doesn't match", "[^aeiou]+")

[90m[1] │[39m [36m<m>[39ma[36m<tch d>[39moe[36m<sn't m>[39ma[36m<tch>[39m

To match a quotation, we'll input the pattern that we specified above:

In [None]:
str_view_all('"Here is a quotation", said the professor. "And here is another."',
             '"[^"]+"')

[90m[1] │[39m [36m<"Here is a quotation">[39m, said the professor. [36m<"And here is another.">[39m

## 🤔 Quiz

How many quotations are there in Ch. 1? (Use the pattern shown above.)

<ol style="list-style-type: upper-alpha;">
    <li>Less than 50</li>
    <li>50-100</li>
    <li>100-150</li>
    <li>150-200</li>
</ol>


In [95]:
# number of quotes

str_count(ch1, '\"[^"]+\"') 

## Backreferences
Parentheses define groups that can be referred to later in the match as `\1`, `\2` etc. This is called a backreference. For example:

    (.)\1

will match the same character repeated twice in a row:

In [None]:
"eel"  %>% str_view("(.)\\1", match = T)

[90m[1] │[39m [36m<ee>[39ml

## 🤔 Quiz

What does this regular expression match?:

```
(..).*\\1
```


<ol style="list-style-type: upper-alpha;">
    <li>Any word that starts and ends with the same character, e.g. `alpha`</li>
    <li>Any word that starts and ends with the same two characters, e.g. `church`</li>
    <li>Any word that ends with two characters that are found earlier in the string, e.g. `therefore`</li>
    <li>I hate regular expressions.</li>
</ol>

In [99]:
# what's it match?

str_view('alpha', '(.).+\\1')

str_view('church', '(..).+\\1')

str_view('therefore', '.+(..).+\\1')

[90m[1] │[39m [36m<alpha>[39m

[90m[1] │[39m [36m<church>[39m

[90m[1] │[39m [36m<therefore>[39m

## Anchors
Sometimes we want a match to occur at a particular position in the string. For example, "all words which start with b". For this we have the special anchor characters: `^` and `$`. The caret `^` matches the beginning of a string. The `$` matches the end.

In [None]:
x <- c('apple', 'banana', 'pear')
str_view(x, '^b')
str_view(x, 'r$')

## 🤔 Quiz

What does this regular expression match?:

```
^(.).*\\1$
```


<ol style="list-style-type: upper-alpha;">
    <li>Any word that starts and ends with the same character, e.g. `alpha`</li>
    <li>Any word that starts and ends with the same two characters, e.g. `church`</li>
    <li>Any word that ends with a character that is found earlier in the string, e.g. `therefore`</li></li>
    <li>I hate regular expressions.</li>
</ol>

In [None]:
# solution

In [None]:
q = 'Research is formalized curiosity. It is poking and prying with a purpose.'
# re to match capitalized words
# re = NA
# str_view(q, re)
# str_extract(q, re)

In [None]:
### `str_match`
`str_match(v, re)` will create a matrix out of the grouped matches in `re`. The first column has the whole match, and additional columns are added for each character group. If the pattern does not match, you will get `NA`s.

In [None]:
head(str_match(words, '^(.).*(.)$'))

0,1,2
,,
able,a,e
about,a,t
absolute,a,e
accept,a,t
account,a,t


In [None]:
### `str_replace`
`str_replace(v, re, rep)` will replace each match of `re` in `v` with `rep`. The most basic usage is as a sort of find and replace:

In [100]:
str_replace('Give me liberty or give me death', '\\w+$', 'pizza')

In [None]:
A very useful feature of regexp replacements is the ability to use backreferences:

In [None]:
# Your code here