# Lab 8: Regular Expressions and Strings

In [1]:
require(tidyverse)
require(stringr)

Loading required package: tidyverse

“running command 'timedatectl' had status 1”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.3     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.0     [32m✔[39m [34mdplyr  [39m 1.0.5
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## Regular Expressions

Regular expressions (regex) are a way to describe patterns in text and are used to search for and match certain patterns in strings.

`Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.` - Jamie Zawinski

For instance, say that you want to find and extract all the email addresses in a document automatically. How might we do that?

### Special characters

Regex takes advantage of several reserved characters that are used for special functions. 

`. \ | ( ) [ ] ^ $ { } * + ?`

### Character classes

* `.` matches anything (wildcard)
* `[aeiou]` matches a single character in the set provided
* `[^aeiou]` matches a single character NOT in the set
* `[a-e]` matches a range, equivalent to `[abcde]`

#### Shorthand

* `\w` matches a "word" character, equivalent to `[a-zA-Z0-9_]`
* `\s` matches any whitespace, including tabs and newlines
* `\d` matches digits, equivalent to `[0-9]`
* `\W`, `\S`, and `\D` match the opposite of the lower-case versions

#### Special characters

* Note that `\t` and `\n` match the tab and newline characters. 
* If you want the "literal" versions of any of the reserved characters, you will need to escape them with a backslash `\`, e.g. `[\.\\\|]`


### Grouping

* `()` are used to group patterns together. This can be used with any of the below operators. This can also be used to extract portions of a regex out individually, which we will later learn.
* `\1`, `\2`, etc. refers to the first, second, etc. group in the match.

### Operators

* `|` is the OR operator and allows matches of either side
* `{}` describes how many times the preceeding character of group must occur:
  * `{m}` must occur exactly `m` times
  * `{m,n}` must occur between `m` and `n` times, inclusive
  * `{m,}` Must occur at least `m` times
* `*` means the preceeding character can appear zero or more times, equivalent to `{0,}`
* `+` means the preceeding character must appear one or more times, equivalent to `{1,}`
* `?` means the preceeding character can appear zero or one time, equivalent to `{0,1}`

### Anchors

* `^` matches the start of a string (or line)
* `$` matches the end of a string (or line)
* `\b` matches a word "boundary"
* `\B` matches not word boundary

### Examples

Go to https://regex101.com/ to play around with creating regex patterns. To start, copy-paste the following paragraph (from [The Ringer](https://www.theringer.com/mlb/2018/10/22/18008004/world-series-boston-red-sox-los-angeles-dodgers-mookie-betts-second-base-jd-martinez)) into the text field.

`According to Baseball Reference’s wins above average, The Red Sox had the best outfield in baseball— one-tenth of a win ahead of the Milwaukee Brewers, 11.5 to 11.4. And that’s despite, I’d argue, the two best position players in the NL this year (Christian Yelich and Lorenzo Cain) being Brewers outfielders. More importantly, the distance from Boston and Milwaukee to the third-place Yankees is about five wins. Two-thirds of the Los Angeles Angels’ outfield is Mike Trout (the best player in baseball) and Justin Upton (a four-time All-Star who hit 30 home runs and posted a 122 OPS+ and .348 wOba this year), and in order to get to 11.5 WAA, the Angels’ outfield would have had to replace right fielder Kole Calhoun with one of the three best outfielders in baseball this year by WAA.`

#### 1. Write a regex that captures all capitalized words.

`\b[A-Z][a-z]+`

#### 2. Write a regex that captures all the numbers.

`\.?\d+\.?\d*`

#### 3. Write a regex that captures all hyphenated words

`\w+-\w+`

#### 4. Write a regex that captures all words with two consecutive wovels

`\w*[aeiou]{2}\w*`

#### 5. Write a regex that captures all words with a repeated letter.

`\w*([a-zA-Z])\1\w*`

#### 6. Write a regex that matches `this` and `the` but not `third`.

`th(e|is)`


## Strings

In [2]:
string1 = "Michigan: BIG 10 Champion!!"
cat(string1)

Michigan: BIG 10 Champion!!

In [3]:
our_state = "Michigan"
ne_states = c("Connecticut", "Maine", "Massachusetts", "Vermont", "New Hampshire", "Rhode Island")
lakemich_states = c("Wisconsin", "Illinois", "Michigan", "Indiana")

In [4]:
our_state %in% ne_states
our_state %in% lakemich_states

Note that there are some special characters. The most commonly used ones are `\n` and `\t` for newlines and tabs, respectively.

Also note that there are some reserved characters do special things in strings. If you want to include them, you must escape them with a backslash `\`.

In [5]:
double_quote = "hi\"bye"
backslash_ex = "a\\tb"
backslash_ex2 = "a\tb"

In [6]:
cat(double_quote)

hi"bye

In [7]:
cat(backslash_ex)

a\tb

In [8]:
cat(backslash_ex2)

a	b

You’ll also sometimes see strings like `"\u00b5"`($\mu$), this is called Unicode-escaping, and is a way of writing non-ASCII characters that works on all platforms.

In [9]:
cat("\u00b5")

µ

In [10]:
cat("\u00e7 (c-cedilla) is a Latin script letter, used in the Albanian, Azerbaijani, Manx, Tatar, Turkish, Turkmen, Kurdish, Zazaki, and Romance alphabets." )


ç (c-cedilla) is a Latin script letter, used in the Albanian, Azerbaijani, Manx, Tatar, Turkish, Turkmen, Kurdish, Zazaki, and Romance alphabets.

In [11]:
cat("You can even use emojis like: \U0001f637")

You can even use emojis like: 😷

### String Functions

In [12]:
ne_states

In [13]:
str_length(ne_states)

In [14]:
cat(str_c('Istanbul', 'Turkey\n', sep=', '))
cat(str_c('Ann Arbor', 'MI', "USA", sep=', '))

Istanbul, Turkey
Ann Arbor, MI, USA

In [15]:
x = c('abc', '123', NA)

In [16]:
str_c('|-', x, '-|')

In [17]:
x

In [18]:
#str_replace_na(x, "None")

In [19]:
str_c('|-', str_replace_na(x), '-|') # finds NA and replaces

To collapse a vector of strings, use the `collapse` argument to `str_c`:

In [20]:
ne_states

In [21]:
str_c(ne_states, collapse=", ")

### Subsetting Strings

In [22]:
ne_states = c("Connecticut", "Maine", "Massachusetts", "Vermont", "New Hampshire", "Rhode Island")
ne_states

In [23]:
str_sub(ne_states, 1, 3)

In [24]:
str_sub(ne_states, -3, -1)

In [25]:
str_sub(ne_states, 1, 7)  # notice that this didn't fail for Maine

In [26]:
str_sub(ne_states, 1, 1) = str_to_lower(str_sub(ne_states, 1, 1))
ne_states

In [27]:
str_sub(ne_states, -3, -1) = str_to_upper(str_sub(ne_states, -3, -1))
ne_states

## Using regular expressions in R

In `R`, we will use `str_view` and `str_view_all` to play with regular expressions. 

Note that other functions you've used previously, such as `str_detect` and `str_replace`, can also take regular expressions as patterns.

`str_view` and `str_view_all` take a string (or a vector of strings) and show you the matches to a pattern.

In [59]:
x = c("apple", "banana", "pear", "orange")

In [60]:
str_view(x, "an")

In [30]:
str_view_all(x, "an")

In [61]:
str_extract(x, "an")

In [62]:
str_extract_all(x, "an")

## Exercises

Use `stringr::words` to do the exercises

If `str_view` does not work for you, you can call above function for the purpose of this question

In [44]:
str_view2 = function(x, reg){
    # a function that mimics str_view in non-html enviroment
    z = str_extract_all(x, reg)
    cat(str_c(map_chr(z, ~ifelse(length(.) != 0, str_c(.,'\n'), "")), collapse = ""))
}

### 1 Start with `y` (I've done this one for you)

In [45]:
str_view2(words, "^y\\w*")

year
yes
yesterday
yet
you
young


In [46]:
str_view(words, "^y\\w*", match=T)

### 2  End with `x`

### 3 Are exactly two letters long (don’t cheat by `using str_length`!)

### 4 Have ten letters or more

### 5 End with `ed`, but not with `eed`

### 6 End with `ing` or `ise`

### 7 End with the same two-letter sequence they start with (e.g. `church`)

### 8 Empirically verify the rule "i before e except after c" (use multiple patterns to check this)

### 9 Match length 5 palindomes

### 10 Try to match the valid `dates` below (first row) without matching the invalid dates (the rest).
Hint: Start by writing a pattern that matches all the entries. Then try to refine your pattern to omit the invalid dates.

In [56]:
dates = c('2012-05-13', '2014-12-31', '1991-06-14', '1991/06/14',
          '200a-05-13',  # invalid year
          '2014-15-20',  # invalid month
          '2014-00-20',  # invalid month
          '2016-04-35',  # invalid day
          '2014-12-00',  # invalid day
          '2013/03-25')  # non-matching separators

# str_view(dates, 'pattern')

## Further REGEX examples

### EX 1
```email@domain.com
firstname.lastname@domain.com
email@subdomain.domain.com
firstname+lastname@domain.com
email@123.123.123.123
1234567890@domain.com
email@domain-one.com
_______@domain.com
email@domain.name
email@domain.co.jp
firstname-lastname@domain.com

plainaddress
#@%^%#$@#$@#.com
@domain.com
Joe Smith <email@domain.com>
email.domain.com
email@domain@domain.com
.email@domain.com
email@domain.com (Joe Smith)
email@domain
email@-domain.com
email@domain..com```

#### Write a regex that matches all the valid emails but none of the invalid ones

`^\w+[\w+.-]+@\w+[\w-]+(\.[\w-]+)+$`


### EX 2


```
12
1048
3.14529
0.87
-255.34
123,340.00 
-16,123,340
1.9e10 
-5.8e5
1.45e-5

720p
164.
.87
124..43
153.243.232
123,,546
24.256,453
123,34,123
,253
12.4e6```

#### Write a regex that matches all the valid numbers below but none of the invalid ones.

`^(-?\d+(,\d{3})*(\.\d+)?|(\d(\.\d+)e-?\d+)?)$`