# Lecture 8.2: String Functions
<div style="border: 1px double black; padding: 10px; margin: 10px">

**After today's lecture you will:**
* Have a deep understanding of Harry Potter and string functions
</div>

This lecture corresponds to Chapter 14 of your textbook

In [1]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.4     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.0.1     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## Harry Potter
Today's lecture will be all about Harry Potter.
![harry potter](https://images-na.ssl-images-amazon.com/images/I/51HSkTKlauL._SX346_BO1,204,203,200_.jpg)

In [2]:
#install.packages("devtools")
#devtools::install_github("bradleyboehmke/harrypotter",force=TRUE)
library(harrypotter)
str(philosophers_stone)
ch1 <- philosophers_stone[[1]]

 chr [1:17] "THE BOY WHO LIVED　　Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfe"| __truncated__ ...


## String functions in R

R base built-in commands for dealing with strings, but as with the base R data manipulation commands, they have an inconsistent interface and are hard to remember. Instead we will focus on functions provided by the `stringr` package, which is part of `tidyverse`. They all start with `str_`. 

### Combining strings
Combining two strings into one is called "concatenation" by computer scientists and "combining strings" by everyone else. `concatenate` is hard to type, so it is abbreviated `str_c`:

In [4]:
c("a", "b", "c") %>% print
str_c("a", "b", "c") 

[1] "a" "b" "c"


Like most other commands, `str_c` is vectorized, meaning it will take vector arguments and recycle the shorter ones to the length of the longest:

In [5]:
mystrings <- c("one", "two", "ten")
str_c("*** ", mystrings , " ***") 
# each argument is expanded to the length of the longest

As usual, `NA` values propagate:

In [6]:
mystrings_na <- c("one", "two", NA)
str_c("*** ", mystrings_na, " ***") # missingness is contagious!

Another use of `str_c` is to combine multiple strings into one:

In [7]:
str_c("one", "two", "ten", sep = ", ") # can provide a separator

If you already know some R, you might recognize this as being equivalent to 
```{r} 
paste("one", "two", "ten", sep=", ")
```

Be mindful of the difference between passing in a vector of strings as a single argument, and passing in multiple strings as separate arguments:

In [8]:
str_c(mystrings, sep = ", ") # why does this not combine the strings?
str_c(mystrings, collapse = ", ") # use collapse if the strings you want to combine are in a vector

## Regular expressions

Regular expressions (regex, regexps) are a programming language that allows you to describe patterns in strings. They have a steep learning curve but are very powerful for working with text data. In this class we will just focus on the basics of regexps. A good tool for learning regexps is [regex101](https://regex101.com/), which lets you interactively edit and debug your regular expressions.

The commands `str_view` and `str_view_all` take a character vector and a regular expression, and show you how they match. 

The most basic regular expression is a plain string. It will match if the other string contains it as a substring.

In [9]:
options(jupyter.rich_display=T) # needed for str_view to work in jupyter notebook

In [12]:
#install.packages("htmlwidgets")
x = c("apple", "banana", "pear") %>% print
 str_view(x, pattern = "an")
str_view(ch1, "Harry")

[1] "apple"  "banana" "pear"  


Here `str_view` has matched our regexp (`"an"`) inside of the second string `banana` of the vector `x`.

You might wonder why, if `banana` has two instances of the pattern `an`, did `str_view` only return the first? This is its default behavior. To print all the matches, use `str_view_all`:

In [13]:
 str_view_all(x, pattern = "an")



### Wildcards
Our first non-trivial regular expression will use a wildcard: `.`. Used inside of a regular expression, the period matches any single character:

In [14]:
str_view("else every eele etcetera", "e..e ") 
str_view_all("else every eele etcetera", "e..e ") 
str_extract_all("else every eele etcetera", "e..e ") 

### Exercise
**Beginner** What's the first word in ch1 that ends in `ing`?

**Advanced** What 4-letter words in ch1 begin and end with the letter `e`?

In [60]:
ch1 %>% str_extract_all("......ing")  # beginner
ch1 %>% str_extract_all(" e..e ")

Suppose I want to answer the question: what are the characters named in Harry Potter? At a first pass, we might guess that a character name is one (or more) capitalized words. How can I match capitalized words?

A capitalized word matches the following *pattern*:

    capitalized word = 
    "<upper case letter><one or more other letters><space>"

A regular expression lets us match this type of pattern using something called a *character class*.

### Character classes
A "character class" is a special pattern that matches a collection of characters. For example, `\d` will match any digit:

In [62]:
str_view(c("number1", "two", "3hree"), "\\d")

Similarly, `\s` will match whitespace (spaces, tabs and newlines):

In [63]:
y = c("spa ce", "hello\tworld", "multi\nline")
writeLines(y)
str_view(y, "\\s")

spa ce
hello	world
multi
line


You can form your own character class using square brackets: `[abc]` will match *one of* `a`, `b`, or `c`. In other words, the 'width' of a character class is one character by default.

In [64]:
str_view(x, '[be]a')  # Match either 'b' or 'e' followed by a

We can use character classes to match the first capital letter of a capitalized word:

In [65]:
str_view(c("These", "are", "some Capitalized words"),
         "[ABCDEFGHIJKLMNOPQRSTUVWXYZ]")

We do not need to go to all the trouble of typing each capital letter. We can use the shortcut `[A-Z]` instead.

In [66]:
str_view(c("These", "are", "some Capitalized words"), "[A-Z]")

In [67]:
str_view_all(str_sub(ch1, 1, 50), "[A-Z]")

So far we are matching just single characters. Now I want to expand the match to include the whole word. To do so I will use the special character class called `\w`, which matches a "word character":

In [68]:
str_view_all("Here is a sentence", "\\w")

Note the additional level of escaping needed here: "\\w" gets parsed by R into the string `\w`:

In [69]:
writeLines("\\w")

\w


The `\w` is then parsed again by the regular expression.

The string "\w" is not valid in R, because there is no escape code "\w":

```
> "\w"
Error: '\w' is an unrecognized escape in character string starting ""\w"
Traceback:
```

Using `\w`, we can expand our regexp to match capitalized words containing two letters:

In [70]:
str_view_all(str_sub(ch1, 1, 50), "[A-Z]\\w")

In [71]:
str_view_all(str_sub(ch1, 1, 50), "[A-Z]\\w\\w")
str_view_all(str_sub(ch1, 1, 50), "[A-Z]\\w\\w\\w")

Now we have run into an issue: by adding more `\w`s, we have excluded the all capitalized words with three letters. But we want to match capitalized words of any length. To do this, we will introduce a *quantifier*. The `*` character says, "match zero or more of the thing that came immediately before me":

In [72]:
str_view_all(str_sub(ch1, 1, 50), "[A-Z]\\w*")
str_extract_all(str_sub(ch1, 1, 50), "[A-Z]\\w*")

In [75]:
str_extract_all(ch1, "[A-Z]\\w\\w\\w\\w*")[[1]] %>% unique
# note the difference with:
 str_view_all(str_sub(ch1, 1, 50), "[A-Z]\\w\\w\\w*")

The most general form of quantifier is `{<min>,<max>}`:

In [80]:
str_extract_all(ch1, "[A-Z]\\w{3,}")[[1]] %>% unique

### Exercise
Find all capitalized words in chapter 1 that have at least 6 characters.

In [81]:
str_extract_all(ch1, "[A-Z]\\w{5,}") %>% 
    unlist %>% table

.
   Although     Bonfire    Borrowed     Bristol     Britain      Couldn 
          1           1           1           1           1           1 
    Dedalus      Diggle      Dudley  Dumbledore      Dundee     Dursley 
          1           1           9          37           1          47 
 Dursleyish    Dursleys    Everyone     Exactly     Experts      Famous 
          1           7           1           1           1           2 
     Flocks      Godric   Grunnings      Hagrid      Harold      Harvey 
          1           1           2          14           1           1 
     Hollow      Howard      Inside     Instead      Little      London 
          1           1           1           2           1           1 
 McGonagall    McGuffin      Muggle     Muggles  Mysterious     Nothing 
         26           1           3           6           1           1 
     People     Perhaps     Petunia     Pomfrey      Potter     Potters 
          2           1           3           1  

### Word boundaries
The pattern we used above for finding capital words is not quite accurate, since it assumes that a space comes after each word. Also, it matches capital letters occurring in the middle of a word.

In [82]:
str_view_all(c("Dick vanDyke was a TV host", "Roger Federer"),
             "[A-Z]\\w* ")

A better pattern would be something like:

    capitalized word = 
        <word boundary><upper case letter><zero or more other letters><word boundary>
    
Regular expressions give us the ability to do this using the special character `\b`. This matches the boundary of a word:

In [83]:
str_view_all("Here is a sentence", "\\b")

Let's try to count the words in a sentence. First, we need a pattern for what a word looks like:

    <word boundary><one or more letters><word boundary>
    
The `str_count()` function counts the number of matches:

A closely related quantifier is `+`, which matches *one* or more of the preceding thing:

In [92]:
str_count("Here is a sentence", "\\b\\w+\\b")

In [93]:
str_count(ch1, "\\b\\w+\\b")

In [101]:
str_view_all("Here is a sentence", "sentence")
str_count("Here is a sentence", "sentence")
str_count(ch1,"Harry")

### Exercise (Post your Answer in Piazza)
**Beginner** How many words are in Chapter 1?

**Advanced** How many words ending in `ing` are in Chapter 1?

## Grouping
In the previous exercise we found that "Professor" is one of the most common capitalized words. Is there a character named Professor, or is it just a title? Now let us try to match one or more capitalized words in a row. We can accomplish this by creating a *group*, and then applying a quantifier to it. 

To create a group, I surround a part of my regexp with parentheses:

In [102]:
str_view("this will be grouped", "[a-z]+ ?")
str_view("this will be grouped", "([a-z]+ ?)")

The parentheses do not change the regular expression (but they are doing something else, which we will discuss in the next lecture.) But now I can apply a quantifier to the whole group:

In [103]:
str_view("this will be grouped", "([a-z]+ ?)+")

In [106]:
str_extract_all(ch1, "([A-Z]\\w{4,} )+", 
                simplify=T) %>% fct_count %>% 
                top_n(10)

Selecting by n



f,n
<fct>,<int>
Dudley,7
Dumbledore,18
Dursley,38
Dursleys,5
Hagrid,7
Harry,9
Muggles,5
Potters,5
Privet,5
Professor,12


Earlier we looked at quotations. The first quotation in chapter 1 is:

In [107]:
str_sub(ch1, 2150, 2163)

How can we find other quotes? The pattern for a quote is a quotation mark, followed by any number of things that are not a quotation mark, followed by another quotation mark:

    <quotation mark><anything that is not a quotation mark><quotation mark>

To match this, we will use a *negation*. A negation is a character class that begins with the character "^". It matches anything that in *not* inside the character class:

In [108]:
str_view_all("match doesn't match", "[^aeiou]+")

To match a quotation, we'll input the pattern that we specified above:

In [109]:
str_view_all('"Here is a quotation", said the professor. "And here is another."',
             '"[^"]+"')

## Exercise （Post your Answer in Piazza)
**Beginner** How many quotations are there in ch1?

**Advanced** What is the longest quotation in the whole book?