# Manipulating Strings
Strings can have much more variety as compared to numbers. In this exercise you will explore the fundamental string functions you can use for data wrangling. You will use the tidyverse package stringr. It isn't loaded as part of the tidyverse library so you will need to load it with separate library statements.

The stringr package contains many functions that detect string patterns and modify strings. The workhorse of string detection is regular expressions. Pragmatically, regular expressions are strings that use wildcards for matching. There are many different symbols that work together to implement the string pattern for both matching and replacement. 


## R Features
* library()
* print()
* cat()
* str_length()
* str_c()
* c()
* str_replace_na()
* str_sub()
* str_to_lower()
* str_to_upper()
* str_to_title()
* str_trim()
* str_detect()
* str_extract()
* str_replace()
* glimpse()
* mutate()
* levels()
* filter()
* select()
* distinct()
* arrange()
* head()
* as.character()
* as.factor()

## Datasets
* mpg


In [None]:
# Load libraries
___(___)    # stringr
___(___)  # tidyverse


## String literals and variables
When you use string functions you can pass in a string literal (a stand alone string) or a variable that contains the string data. Let's get familiar with how to create string literals and save them in variables. 

For diagnostic purposes, you can enclose a line of code in parenthesis '(< code >)' and it will display the result.

The below information is for your reference as it is too confusing to provide text instructions of how to do this.

In [None]:
# String literal
"This is a string literal"

# A three letter string
str_three <- "abc"

# Strings can use single or double quotes
str_single <- 'This is a string using single quotes'
str_double <- "This is a string using double quotes"

# Choose single or double quotes with embedded quote characters
str_single_embedded <- 'This is a string using single quotes with "embedded" double quotes'
str_double_embedded <- "This is a string using double quotes with 'embedded' single quotes"

# Escaping a quote character using a backslash
str_double_embedded_escaped <- "This is a string using double quotes with \"embedded\" single quotes using escape"

# Use print to print these
print(str_single)
print(str_double)
print(str_single_embedded)
print(str_double_embedded)
print(str_double_embedded_escaped)

# Assign to variable and print using parenthesis
(str_assign_print <- "Assigning string and printing using parenthesis")

# multi-line string
str_multi_line <- "Line 1
Line 2
Line 3"

# Print displays newline as \n
# cat displays actual newline
print(str_multi_line)
cat(str_multi_line)
cat("\n") # Needed to reset to new line for next output

# Storing multiple strings in one variable
# known as a character vector
# Can store literals, variables, and NA
str_vector <- c("string 1", str_single_embedded, NA, str_multi_line)

# cat output
cat(str_vector)
cat("\n") # Needed to reset to new line for next output

# print output easier to see the vector
print(str_vector)

## Stringr package

There are 25 string function categories in the stringr package. 

* case                    Convert case of a string.

* invert_match            Switch location of matches to location of non-matches.
                        
* modifiers               Control matching behaviour with modifier functions.
                        
* str_c                   Join multiple strings into a single string.

* str_conv                Specify the encoding of a string.

* str_count               Count the number of matches in a string.

* str_detect              Detect the presence or absence of a pattern in a string.
                        
* str_dup                 Duplicate and concatenate strings within a character vector.
                        
* str_extract             Extract matching patterns from a string.

* str_interp              String interpolation.

* str_length              The length of a string.

* str_locate              Locate the position of patterns in a string.

* str_match               Extract matched groups from a string.

* str_order               Order or sort a character vector.

* str_pad                 Pad a string.

* str_replace             Replace matched patterns in a string.

* str_replace_na          Turn NA into "NA"

* str_split               Split up a string into pieces.

* str_sub                 Extract and replace substrings from a character vector.
                        
* str_subset              Keep strings matching a pattern.

* str_trim                Trim whitespace from start and end of string.

* str_trunc               Truncate a character string.

* str_view                View HTML rendering of regular expression match.
                        
* str_wrap                Wrap strings into nicely formatted paragraphs.

* stringr-data            Sample character vectors for practicing string manipulations.
                        
* word                    Extract words from a sentence.


In [None]:
# Use library(help = "package")
# to see the functions
library(help = "___")

## str_length()
The length of a string. Technically this returns the number of "code points", in a string. One code point usually corresponds to one character, but not always. For example, an u with a umlaut might be represented as a single character or as the combination a u and an umlaut. 
### Usage
str_length(string)

In [None]:
# View help on str_length()
?___

In [None]:
# What is the string length of 
# str_three
print(str_three)
___(str_three)

# What is the string length of 
# str_multi_line
print(str_multi_line)
___(str_multi_line)

# What is the string length of 
# str_vector
print(str_vector)
___(str_vector)

# What is the string length of 
# letters
print(letters)
___(letters)

Are the string lengths what you expected? 

For the most part, str_length is counting characters. Be careful that this function is vectorized and when you pass it a character vector, which is what letters was, it will count each of them separately.

Notice that new lines although printed as \n are counted as a single character.

## str_c()
Join multiple strings into a single string. To understand how str_c works, you need to imagine that you are building up a matrix of strings. Each input argument forms a column, and is expanded to the length of the longest argument, using the usual recyling rules. The sep string is inserted between each column. If collapse is NULL each row is collapsed into a single string. If non-NULL that string is inserted at the end of each row, and the entire matrix collapsed to a single string. 
### Usage
str_c(..., sep = "", collapse = NULL)

In [None]:
# View help on str_c()
?___

In [None]:
# Combine two string literals
# print result and length
str_combine_literals <- str_c("first", "second")
print(str_combine_literals)
___(str_combine_literals)

# Combine literals and single string variables
# print result and length
str_combine_alltypes <- ___("literal", str_three, str_vector)
print(str_combine_alltypes)
___(str_combine_alltypes)

# Combine literals and character vector str_vector
# print result and length
str_combine_alltypes <- ___("literal2", str_vector)
print(str_combine_alltypes)
___(str_combine_alltypes)

# Combine literals and character vector letters
# print result and length
str_combine_letters <- ___("Letter: ", letters)
print(str_combine_letters)
___(str_combine_letters)


## Recycling
Combining strings reveals an interesting property of R called recycling. This isn't referring to memory garbage collection. It is the replication of data in one vector to match the length of data in another vector. This part of the vectorization built into R. 

When combining the string literal "Letter: " with the letters character vector of length 26, the str_c() function made 25 copies of the string literal "Letter: " so it could be matched up which each entry of the 26 letters character vector.

In [None]:
# Combine two character vectors of same size
str_1_2 <- c("1", "2")
str_a_b <- c("a", "b")
str_c(___, ___)

# Combine two character vectors of multiples of longest one
___(str_1_2, letters)

# Combine two character vectors not multiple of longest one
str_1_2_3 <- c("1", "2", "3")
___(str_1_2_3, letters)

* When the length of both character vectors are the same, they just match up. 
* When one character vector is smaller than the other but a whole multiple of each other, then it recycles or replicates the smaller vector to match the larger one. 
* When the larger character vector isn't a whole multiple of the smaller one, it still recycles, but also issues a warning message "longer object length is not a multiple of shorter object length"

## Collapsing and separating strings
str_c() can also take multiple strings and character vectors and collapse them into a single string.
* sep value is put between each of the string vectors
* collapse value is put between each set of string vectors and collapses the result into a single string

In [None]:
# Collapse letters character vector into a single string
# No separator between strings
# Hint: collape=""
str_c(letters,___)

# Collapse letters character vector into a single string
# Separate using a comma and a space
# Hint: collape=", "
str_c(letters, ___)

# Combine str_1_2 and letters and str_a_b character vectors into a single string
# separating each combined string by a colon 
# separating each collapsed string by a comma and space
# Result: '1:a:a, 2:b:b, 1:c:a, ...'
# Hint: sep=":", collapse=", "
str_c(str_1_2, letters, str_a_b, ___, ___)

There is a lot of flexibility in how to combine, separate, and collapse strings.

## str_replace_na()
Turn NA into "NA". 
### Usage
str_replace_na(string, replacement = "NA")

In [None]:
# View help on str_replace_na()
?___

In [None]:
# Missing inputs give missing outputs
str_c("something", NA)

# Combining character vectors containing NA
# Combine str_contains_na with letters
str_contains_na <- c("-", NA)
___(___, ___)

# Use str_replace_na before str_c to display NA
# Convert NA in str_contains_na to "NA"
# Combine above with letters
str_c(___(___), letters)

## str_sub()
Extract and replace substrings from a character vector. str_sub will recycle all arguments to be the same length as the longest argument. If any arguments are of length 0, the output will be a zero length character vector.
### Usage
str_sub(string, start = 1L, end = -1L)
str_sub(string, start = 1L, end = -1L) <- value

In [None]:
#View help on str_sub()
?___

In [None]:
# Define a string to work with
str_fox_dog <- "The crazy brown fox jumped over the lazy dog."

# Display the first 9 letters
# of str_fox_dog
# use all positional arguments
# no named parameters
___(str_fox_dog, ___, ___)

# Pipe in string using %>% 
# use named parameter 'end'
str_fox_dog %>% str_sub(end = 9)

# Display the last 9 letters
# of str_fox_dog
str_fox_dog %>% ___(___)

# Display just dog using str_sub
# of str_fox_dog
str_fox_dog %>% ___(start = ___, end = ___)

# In str_fox_dog replace dog with cats
# Hint: Pipe doesn't work in this case
___(str_fox_dog, start = ___, end = ___) <- "___"

# Print str_fox_dog
print(___)

If you know the starting and ending locations in a string relative to either the start or the end of the string, you can extract the substring and even replace it.

## substring using character vectors
When applying str_sub() to a character vector (containing multiple strings) the function is applied element-wise in typical vectorized fashion including vector recycling as necessary. 

In [None]:
# Define some strings
str_single <- 'This is a string using single quotes'
str_double <- "This is a string using double quotes"
str_combined <- c(str_single, str_double)
print(str_combined)

# Return 'This' from str_combined
str_combined %>% ___(end = ___)

# Return 'quotes' from str_combined
str_combined %>% ___(start = ___)

# Return 'quotes' from str_combined
# Return a *single* comma separated string
# Hint: str_c with collapse
str_combined %>% 
   ___(start = ___) %>% 
   ___(collapse = ___)

# Return single string 'b, c, d' 
# using letters vector
# Hint: str_c collapse the str_sub
letters %>% 
   ___(___) %>% 
   ___(___)

You can combine multiple string functions together which enables additional flexibility. This also provides many ways to accomplish the same task. Always look for the most closely matching function for your desired result.

## Upper case, Lower case and Sentence case
Convert case of a string.
### Usage
str_to_upper(string, locale = "")

str_to_lower(string, locale = "")

str_to_title(string, locale = "")

In [None]:
# View help on str_to_lower()
?___

In [None]:
# Define a string to work with
str_fox_dog <- "The crazy brown fox jumped over the lazy dog."

# Return the above string in
# lowercase
# uppercase
# First character of each word capitalized
___(str_fox_dog)
___(str_fox_dog)
___(str_fox_dog)

## str_trim()
Trim whitespace from start and end of string. 
### Usage
str_trim(string, side = c("both", "left", "right"))

In [None]:
# View help on str_trim()
?___

In [None]:
# Define strings
str_left_whitespace <- "   Whitespace    on left"
str_right_whitespace <- "Whitespace   on right   "
str_both_whitespace <- "   Whitespace    on both    "
str_newline <- "
line2
"
str_combined <- c(str_left_whitespace, str_right_whitespace, 
                  str_both_whitespace, str_newline)
print(str_combined)

# Trim whitespace from str_combined
# left only
# right only
# both
print("Trim left")
str_combined %>% ___("___")

print("Trim right")
str_combined %>% ___("___")

print("Trim both")
str_combined %>% ___("___")

# str_trim without specifying a side does what?
print("Trim <none>")
str_combined %>% ___()

* It did trims all sorts of whitespace characters including spaces and newline characters. 
* It did not trim whitespace in the middle of the string.
* Default option is to trim both left and right sides

## Regular Expressions
Regular expressions are strings that can be passed to string functions to aid in matching and replacing characters. It defines many of the symbols to have special meaning for wildcard pattern matching. 

To use a symbol as a text literal, you will need to prefix the symbol with two backslash characters to "escape" it. It is needed twice, since both R and regular expression use the same character to mean escape. 

* \s matches whitespace
* * means zero or more occurrences of previous character
* . means any character
* ^ means start of string
* $ means end of string
* [abc] means one of these characters

## str_detect()
Detect the presence or absence of a pattern in a string. Vectorised over string and pattern. 
### Usage
str_detect(string, pattern)

In [None]:
# View help on str_detect()
?___

In [None]:
# Define strings
str_apple <- c(" apple pie", "apple", "Apple pie cake", 
               "banana apple pie", "blueberry pie", "apple apple", "apricot applesause cake")

# Return true false vector for strings containing 'apple'
# Assign to match_index
print("strings containing 'apple'")
___ <- ___(str_apple, "___")

# print match_index
print(___)

# Print strings associated with
# TRUE in match_index
# Hint: Use the index inside [] for the string variable
str_apple[___] %>% 
   print()

# Print strings containing 'pie'
print("strings containing 'pie'")
___ <- ___(str_apple, "___")
str_apple[___] %>% 
   print()

# Print strings ending in 'pie'
# Hint: use $
print("strings ending in 'pie'")
___ <- ___(str_apple, "___")
str_apple[___] %>% 
   print()

# Print strings starting with apple
# Hint: use ^ for starting
print("strings starts with 'apple'")
___ <- ___(str_apple, "___")
str_apple[___] %>% 
   print()

# Print strings starting with apple
# Ignore whitespace and match both 'apple' and 'Apple'
# Hint: use ^ for starting
# Hint: use \s for space and * for zero or more
# Hint: use [Aa] for upper and lower case A
print("strings starts with 'apple' enhanced")
___ <- ___(str_apple, "___")
str_apple[___] %>% 
   print()

str_detect is useful to find the rows of data that has a given pattern match. 

## str_extract()
Extract matching patterns from a string. Vectorised over string and pattern. 
### Usage
str_extract(string, pattern)

In [None]:
# View help on str_extract()
?___

In [None]:
# Define strings
str_apple <- c(" apple pie", "apple", "Apple pie cake", 
               "banana apple pie", "blueberry pie", "apple apple", "apricot applesause cake")

# Print strings starting with apple
# Hint: use ^ for starting
print("strings starts with 'apple'")
___(str_apple, "___")

# Print strings starting with apple
# Ignore whitespace and match both 'apple' and 'Apple'
# Hint: use ^ for starting
# Hint: use \s for space and * for zero or more
# Hint: use [Aa] for upper and lower case A
print("strings starts with 'apple' enhanced")
___(str_apple, "___")

# Find 'apple'or 'Apple' _middle_text_ then 'cake'
# Hint: use [Aa] and .*
print("Find 'apple' _middle_text_ then 'cake'")
___(str_apple, "___")

str_detect is useful for extracting the matching text

## str_replace()
Replace matched patterns in a string. Vectorised over string, pattern and replacement. 
### Usage
str_replace(string, pattern, replacement)

In [None]:
# View help on str_replace
?___

In [None]:
# Define strings
str_apple <- c(" apple pie", "apple", "Apple pie cake", 
               "banana apple pie", "blueberry pie", "apple apple", "apricot applesause cake")

# Replace apple with cherry for str_apple
___(str_apple, "___", "___")

# Work with mpg dataset model column
# Glimpse mpg
___(mpg)

# Make a copy of mpg as df
___ <- ___

# Convert df$model to a factor
# Hint: as.factor()
df <- df %>% mutate(___ = ___)

# Display model factor levels
# Hint: levels()
df$model %>% ___()

# Select only model column
# Provide distinct data only
# Sort alphabetically
# Display first 5
df %>% select(___) %>% 
   ___() %>%
   ___() %>%
   ___(5)

# Detect 2, 4 and all wheel drive in mpg$model directly
# Display unique values of model containing 2 or 4 or all wheel drive
# Sort alphabetically
# Display first 5
# Hint: use filter() and str_detect()
df %>% ___(___(model, "___")) %>%
   ___(model) %>% 
   ___() %>%
   ___() %>%
   ___(5)

# Update df$model to remove 2wd, 4wd, and awd
# Also remove any surrounding whitespace
# Hint: mutate(model), convert to char and back to factor as.character() and as.factor()
# Hint: str_replace() \s for whitespace * for zero or more occurrances, [] for character groups
df <- df %>% ___(model = ___(model %>% ___, "___", "") %>% ___)

# Display model factor levels
# Hint: levels()
df$model %>% ___()

* Notice the removal of 2wd and 4wd and awd from model column
* Factors need to be converted to character vectors before replacement

# Summary
There are many string functions to help you wrangle data in the stringr package. You can combine them together to solve all your string manipulation needs.