# Lab 6 Strings and Vectors
## Strings

In [3]:
require(tidyverse)
require(stringr)
install.packages('htmlwidgets')
require(htmlwidgets)

Loading required package: tidyverse
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats
Loading required package: stringr
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Loading required package: htmlwidgets


In [2]:
string1 = "Michigan: BIG 10 Champion!!"

In [3]:
our_state = "Michigan"

In [4]:
ne_states = c("Connecticut", "Maine", "Massachusetts", "Vermont", "New Hampshire", "Rhode Island")

In [5]:
our_state %in% ne_states

Need a backslash for special characters like double quotes

In [9]:
double_quote = ' \" '

In [10]:
print(double_quote)

[1] " \" "


In [11]:
double_quote

There are a handful of other special characters. The most common are "\n", newline, and "\t", tab, but you can see the complete list by requesting help on ": ?'"', or ?"'". You’ll also sometimes see strings like "\u00b5", this is a way of writing non-English characters that works on all platforms.

In [12]:
x <- "\u00b5"
x

### String Functions

In [15]:
str_length(ne_states)

In [16]:
str_c('Seoul', 'Korea', sep=', ')

In [17]:
x = c('abc', NA)
print(x)

[1] "abc" NA   


In [18]:
str_c('|-', x, '-|')

In [19]:
str_c('|-', str_replace_na(x), '-|')

To collapse a vector of string, use *collapse*

In [22]:
str_c(c("Boston", "New Haven", "Burlington"), collapse = ", ")

### Subsetting Strings

In [23]:
ne_states

In [24]:
str_sub(ne_states, 1, 3)

In [25]:
str_sub(ne_states, -3, -1)

In [26]:
str_sub(ne_states, 1, 7) # notice that this didn't fail for Maine

In [27]:
str_sub(ne_states, 1, 1) <- str_to_lower(str_sub(ne_states, 1, 1))

In [28]:
str_sub(ne_states, 1, 1) <- str_to_upper(str_sub(ne_states, 1, 1))

#### Locales
Turkish has two i's: with and without a dot, and it has a different rule for capitalising them:

In [29]:
str_to_upper(c("i", "ı"))

In [30]:
str_to_upper(c("i", "ı"), locale = "tr")

The locale is specified as a ISO 639 language code. Wikipedia the code for your language.

Another important operation that’s affected by the locale is sorting. The base R  order()  and  sort()  functions sort strings using the current locale. If you want robust behaviour across different computers, you may want to use  str_sort()  and  str_order()  which take an additional locale argument:

In [3]:
x = c("apple", "eggplant", "banana")

str_sort(x, locale = "en")  # English

str_sort(x, locale = "haw") # Hawaiian

## Matching Patterns with regular expressions
To learn regular expressions, we’ll use * str_view()*  and  *str_view_all()* . These functions take a character vector and a regular expression, and show you how they match. We’ll start with very simple regular expressions and then gradually get more and more complicated. Once you’ve mastered pattern matching, you’ll learn how to apply those ideas with various  stringr  functions

#### Exact Matching

In [42]:
x = c("apple", "banana", "pear")
str_view(x, "an")

#### Matching every character except new line character


In [44]:
str_view(x, ".a.")

But if “.” matches any character, how do you match the character “.”? You need to use an “escape” to tell the regular expression you want to match it exactly, not use its special behaviour. Like strings, regexps use the backslash, \, to escape special behaviour. So to match an ., you need the regexp \. Unfortunately this creates a problem. We use strings to represent regular expressions, and \ is also used as an escape symbol in strings. So to create the regular expression \. we need the string "\\.".

In [4]:
# To create the regular expression, we need \\
dot = "\\."

# But the expression itself only contains one:
writeLines(dot)

# And this tells R to look for an explicit .
str_view(c("abc", "a.c", "bef"), "a.c")

\.


ERROR: Error in loadNamespace(name): there is no package called ‘htmlwidgets’


If \ is used as an escape character in regular expressions, how do you match a literal \? Well you need to escape it, creating the regular expression \\. To create that regular expression, you need to use a string, which also needs to escape \. That means to match a literal \ you need to write "\\\\" — you need four backslashes to match one!

In [46]:
x = "a\\b"
writeLines(x)

str_view(x, "\\\\")

a\b


#### Anchors

It's often useful to anchor the regular expression so that it matches from the start or the end of a string. You can use:
1. ^to match the start of a string
2. $ to match the end of a string

In [47]:
x = c("apple", "banana", "pear")
str_view(x, "^a")

In [48]:
str_view(x, "a$")

To force a regular expression to only match a complete string, anchor it with both ^ and $.

In [49]:
x = c("apple pie", "apple", "apple cake")
str_view(x, "apple")

In [56]:
str_view(x, "^apple$")

### Exercises
Given the corpus of common words in stringr::words, create regular expressions that find all words that:

- Start with “y”.
- End with “x”
- Are exactly three letters long. (Don’t cheat by using str_length()!)
- Have seven letters or more. Since this list is long, you might want to use the match argument to str_view() to show only the matching or non-matching words.

In [7]:
words = stringr::words
str_view(words, "^y", match=T)
str_view(words, "x$", match=T)
str_view(words, "^[a-z]{3}$")
str_view(words, "^[a-z]{7}", match=T)

#### Character classes and alternatives
We've already seen that the period character matches every character except the new line character. The following are also extremely useful:

- \d for matching a digit
- \s for mathcing any whitespace (e.g. space, tab, newline).
- [abc]: matches a, b, or c.
- [^abc]: matches everything except a, b and c
Remember, to create a regular expression containing \d or \s, you’ll need to escape the \ for the string, so you’ll type "\\d" or "\\s".

You can use alternation to pick between one or more alternative patterns. For example, abc|d..f will match either “abc”, or "deaf". abc|xyz matches "abc" or "xyz" not "abcyz" or "abxyz". Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:

In [8]:
str_view(c("grey", "gray"), "gr(e|a)y")

In [52]:
str_view(c("abc", "xyz", "abcyz", "abxyz"), "abc|xyz")

In [53]:
### Exercises


- Create regular expressions to find all words that:

    - Start with a vowel.

    - That only contain consonants. (Hint: thinking about matching “not”-vowels.)

    - End with ed, but not with eed.

    - End with ing or ise.

- Empirically verify the rule “i before e except after c”.

- Is “q” always followed by a “u”?

- Write a regular expression that matches a word if it’s probably written in British English, not American English.

- Create a regular expression that will match telephone numbers as commonly written in your country.

In [None]:
str_view(words, "^[a,i,e,o,u]", match=T)
str_view(words, "^[^a,i,e,o,u]", match=T)
str_view(words, "[^e]ed$", match=T)
str_view(words, "(ing|ise)$", match=T)
str_view(words, "q[^u]", match=T)

## Exercies 1

Write a regular expression to match the following strings
- regular expression
- regular expressions
- regex
- regexp
- regexes


In [8]:
text <- c("regular expression",
          "regular expressions",
          "regex",
          "regexp",
          "regexes")

## Vectors and Datatypes

Vectors: https://rawgit.com/byoungwookjang/stats406_f17_labs/master/lab1/Stats406Lab1.html

Functions and Loops: https://rawgit.com/byoungwookjang/stats406_f17_labs/master/lab2/Stats406Lab2.html

More strings: https://rawgit.com/byoungwookjang/stats406_f17_labs/master/lab7/Stats406Lab7.html