In [None]:
install.packages('tidyverse') # if this is not done, your str_view, str_view_all functions will not work
library(tidyverse)
remotes::install_github("bradleyboehmke/harrypotter")
library(harrypotter)

# Lecture 13: Regular expressions

<div style="border: 1px double black; padding: 10px; margin: 10px">

**After today's lecture you will:**
* Understand basic regular expressions.
* Use regular expressions to extract data from text.
</div>

These notes correspond to Chapter 16 of your book.


## Regular expressions
Regular expressions (regex, regexps) are a programming language that allows you to describe patterns in strings. They have a steep learning curve but are very powerful for working with text data. In this class we will just focus on the basics of regexps. A good tool for learning regexps is [regex101](https://regex101.com/), which lets you interactively edit and debug your regular expressions.

> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. 
>
> — Jamie Zawinski (famous nerd)

The command `str_view` takes a character vector and a regular expression, and show you how they match. 

The most basic regular expression is a plain string. It will match if the other string contains it as a substring.

In [None]:
x = c("apple", "banana", "pear") %>% print
str_view(x, pattern = "an")

[1] "apple"  "banana" "pear"  


[90m[2] │[39m b[36m<an>[39m[36m<an>[39ma

Here `str_view` has matched our regexp (`"an"`) inside of the second string `banana` of the vector `x`.

In [None]:
fruit

In [None]:
str_view(fruit, 'berry')

[90m [6] │[39m bil[36m<berry>[39m
[90m [7] │[39m black[36m<berry>[39m
[90m[10] │[39m blue[36m<berry>[39m
[90m[11] │[39m boysen[36m<berry>[39m
[90m[19] │[39m cloud[36m<berry>[39m
[90m[21] │[39m cran[36m<berry>[39m
[90m[29] │[39m elder[36m<berry>[39m
[90m[32] │[39m goji [36m<berry>[39m
[90m[33] │[39m goose[36m<berry>[39m
[90m[38] │[39m huckle[36m<berry>[39m
[90m[50] │[39m mul[36m<berry>[39m
[90m[70] │[39m rasp[36m<berry>[39m
[90m[73] │[39m salal [36m<berry>[39m
[90m[76] │[39m straw[36m<berry>[39m

### Wildcards
Our first non-trivial regular expression will use a wildcard: `.`. Used inside of a regular expression, the period matches any single character:

In [None]:
str_view("else every eele etcetera", "e..e ") 

[90m[1] │[39m [36m<else >[39mevery [36m<eele >[39metcetera

If we want to "extract" the first match we can use `str_extract()` instead:

In [None]:
str_extract("else every eele etcetera", "e..e ") 

### Exercise
What's the first word in `ch1` that ends in `ing`?

In [None]:
# extract the first word ending in ing

Suppose I want to answer the question: what are the characters named in Harry Potter? At a first pass, we might guess that a character name is one (or more) capitalized words. How can I match capitalized words?

A capitalized word matches the following *pattern*:

    capitalized word = 
    "<upper case letter><one or more other letters><space>"

A regular expression lets us match this type of pattern using something called a *character class*.

### Character classes
A "character class" is a special pattern that matches a collection of characters. For example, `\d` will match any digit:

In [None]:
str_view(c("number1", "two", "3hree"), "\\d")

Similarly, `\s` will match whitespace (spaces, tabs and newlines):

In [None]:
y = c("spa ce", "hello\tworld", "multi\nline")
writeLines(y)
str_view(y, "\\s")

spa ce
hello	world
multi
line


You can form your own character class using square brackets: `[abc]` will match *one of* `a`, `b`, or `c`. In other words, the 'width' of a character class is one character by default.

In [None]:
str_view(x, '[be]a')  # Match either 'b' or 'e' followed by a

We can use character classes to match the first capital letter of a capitalized word:

In [None]:
str_view(c("These", "are", "some Capitalized words"),
         "[ABCDEFGHIJKLMNOPQRSTUVWXYZ]")

We do not need to go to all the trouble of typing each capital letter. We can use the shortcut `[A-Z]` instead.

In [None]:
str_view(c("These", "are", "some Capitalized words"), "[A-Z]")

In [None]:
str_view_all(str_sub(ch1, 1, 50), "[A-Z]")

So far we are matching just single characters. Now I want to expand the match to include the whole word. To do so I will use the special character class called `\w`, which matches a "word character":

In [None]:
str_view_all("Here is a sentence", "\\w")

Note the additional level of escaping needed here: "\\w" gets parsed by R into the string `\w`:

In [None]:
writeLines("\\w")

The `\w` is then parsed again by the regular expression.

The string "\w" is not valid in R, because there is no escape code "\w":

```
> "\w"
Error: '\w' is an unrecognized escape in character string starting ""\w"
Traceback:
```

Using `\w`, we can expand our regexp to match capitalized words containing two letters:

In [None]:
str_view_all(str_sub(ch1, 1, 50), "[A-Z]\\w")

Or two:

In [None]:
str_view_all(str_sub(ch1, 1, 50), "[A-Z]\\w\\w")
str_view_all(str_sub(ch1, 1, 50), "[A-Z]\\w\\w\\w")

Now we have run into an issue: by adding more `\w`s, we have excluded the all capitalized words with three letters. But we want to match capitalized words of any length. To do this, we will introduce a *quantifier*. The `*` character says, "match zero or more of the thing that came immediately before me":

In [None]:
str_view_all(str_sub(ch1, 1, 50), "[A-Z]\\w*")
str_extract_all(str_sub(ch1, 1, 50), "[A-Z]\\w*")

A closely related quantifier is `+`, which matches *one* or more of the preceding thing:

In [None]:
str_extract_all(ch1, "[A-Z]\\w\\w\\w\\w*")[[1]] %>% unique
# note the difference with:
# str_view(str_sub(ch1, 1, 50), "[A-Z]\\w\\w\\w*")

The most general form of quantifier is `{<min>,<max>}`:

In [None]:
str_extract_all(ch1, "[A-Z]\\w{5,}")[[1]] %>% unique

### Exercise
Find all capitalized words in chapter 1 that have at least 6 characters.

In [None]:
# all cap words that have at least six characters

### Word boundaries
The pattern we used above for finding capital words is not quite accurate, since it assumes that a space comes after each word. Also, it matches capital letters occurring in the middle of a word.

In [None]:
str_view_all(c("Dick vanDyke was a TV host", "Roger Federer"),
             "[A-Z]\\w* ")

A better pattern would be something like:

    capitalized word = 
        <word boundary><upper case letter><zero or more other letters><word boundary>
    
Regular expressions give us the ability to do this using the special character `\b`. This matches the boundary of a word:

In [None]:
str_view_all("Here is a sentence", "\\b")

Let's build a regex that counts the words in a sentence. First, we need a pattern for what a word looks like:

    <word boundary><one or more letters><word boundary>

### Exercise
Translate this pattern into a regexp. Test it out on your favorite sentence:

![image.png](attachment:image.png)

In [None]:
str_count(ch1, "\\b\\w+\\b")

The `str_count()` function counts the number of matches:

In [None]:
str_view_all("Here is a sentence", "sentence")
str_count("Here is a sentence", "sentence")
str_count(ch1, ".")
str_length(ch1)

### Exercise
**Beginner** How many words are in Chapter 1?

**Advanced** How many words ending in `ing` are in Chapter 1?

## Grouping
In the previous exercise we found that "Professor" is one of the most common capitalized words. Is there a character named Professor, or is it just a title? Now let us try to match one or more capitalized words in a row. We can accomplish this by creating a *group*, and then applying a quantifier to it. 

To create a group, I surround a part of my regexp with parentheses:

In [None]:
str_view("this will be grouped", "[a-z]+ ?")
str_view("this will be grouped", "([a-z]+ ?)")

The parentheses do not change the regular expression (but they are doing something else, which we will discuss in the next lecture.) But now I can apply a quantifier to the whole group:

In [None]:
str_view("this will be grouped", "([a-z]+ ?)+")

(Aside: This is sort of the "Ah-ha!" moment when it comes to learning regular expressions. Once you understand that you can do things like this, you begin to unlock their power.)

In [None]:
str_extract_all(ch1, "([A-Z]\\w{4,} )+", 
                simplify=T) %>% fct_count %>% 
                top_n(10)

Earlier we looked at quotations. The first quotation in chapter 1 is:

In [None]:
str_sub(ch1, 2150, 2163)

How can we find other quotes? The pattern for a quote is a quotation mark, followed by any number of things that are not a quotation mark, followed by another quotation mark:

    <quotation mark><anything that is not a quotation mark><quotation mark>


To match this, we will use a *negation*. A negation is a character class that begins with the character "^". It matches anything that in *not* inside the character class:

In [None]:
str_view_all("match doesn't match", "[^aeiou]+")

To match a quotation, we'll input the pattern that we specified above:

In [None]:
str_view_all('"Here is a quotation", said the professor. "And here is another."',
             '"[^"]+"')

### Exercise
**Beginner** How many quotations are there in ch1?

**Advanced** What is the longest quotation in the whole book?

In [None]:
# Your code here

## Backreferences
Parentheses define groups that can be referred to later in the match as `\1`, `\2` etc. This is called a backreference. For example:

    (.)\1

will match the same character repeated twice in a row:

In [None]:
"eel"  %>% str_view("(.)\\1", match = T)

[90m[1] │[39m [36m<ee>[39ml

### Exercise
What does this regular expression match match?

```
"(..).*\\1"
```





In [None]:
# Your code here

In [None]:
### Exercise

**Beginner** Write a regexp that matches words ending in the same vowel repeated twice. (For example, "levee".)

**Advanced** Write a regexp that matches two *or more* repeated characters. It should work as follows:

```
> str_extract(c("breeeeze", "hahahaaaaaaaaa"), 
              my_regexp)
[1] 'eeee' 'aaaaaaaaa'
```

In [None]:
## Alternatives
An *alternative* means *match this or that*. Alternative patterns can be matched using the syntax `(this|that)`.

In [None]:
color_re = "colo(r|ur)"
x <- c("color", "red colour", "coloured glass", "chair", 
       "colored chair")
str_view(x, color_re)

In [None]:
### Example
Suppose we want to match telephone numbers of the form:

* xxx-xxx-xxxx
* (xxx) xxx-xxxx

In [None]:
# complicated because of all the double backslashes
phone_re = "(\\d\\d\\d-|\\(\\d\\d\\d\\) )\\d\\d\\d-\\d\\d\\d\\d" 
writeLines(phone_re)

(\d\d\d-|\(\d\d\d\) )\d\d\d-\d\d\d\d


In [None]:
n <- c("123-456-7890", "(123) 456-7890", "1234567890", "+1-123-456-7890")
str_view(n, phone_re)

In [None]:
## Anchors
Sometimes we want a match to occur at a particular position in the string. For example, "all words which start with b". For this we have the special anchor characters: `^` and `$`. The caret `^` matches the beginning of a string. The `$` matches the end.

In [None]:
x <- c('apple', 'banana', 'pear')
str_view(x, '^b')
str_view(x, 'r$')

In [None]:
### Exercise
What does this regexp do?
```
re = "^(.).*\\1$" 
```

In [None]:
### `str_extract`
`str_extract(v, re)` extracts substring matched by `re` from each element of `v`. Another way to think of this is as returning the portion of the string which is highlighted by `str_view`:

In [None]:
q = 'Research is formalized curiosity. It is poking and prying with a purpose.'
# re to match capitalized words
# re = NA
# str_view(q, re)
# str_extract(q, re)

In [None]:
Analogous to `str_view_all` we have `str_extract_all`:

In [None]:
str_view_all(q, re)
str_extract_all(q, re)

In [None]:
### `str_match`
`str_match(v, re)` will create a matrix out of the grouped matches in `re`. The first column has the whole match, and additional columns are added for each character group. If the pattern does not match, you will get `NA`s.

In [None]:
head(str_match(words, '^(.).*(.)$'))

0,1,2
,,
able,a,e
about,a,t
absolute,a,e
accept,a,t
account,a,t


In [None]:
### `str_replace`
`str_replace(v, re, rep)` will replace each match of `re` in `v` with `rep`. The most basic usage is as a sort of find and replace:

In [None]:
str_replace('Give me liberty or give me death', '\\w+$', 'pizza')

In [None]:
A very useful feature of regexp replacements is the ability to use backreferences:

In [None]:
# Your code here