## Case studies

Practice your string manipulation skills on a couple of case studies. You'll also learn a few new skills, reading strings into R and handling problems of case (e.g. A versus a).

### Getting the play into R
We've already downloaded the play and put the text file in your workspace. Your first step is to read the play into R using stri_read_lines().

You should take a look at the original text file: importance-of-being-earnest.txt

You'll see there is some foreword and afterword text that Project Gutenberg has added. You'll want to remove that, and then split the play into the introduction (the list of characters, scenes, etc.) and the main body.

In [9]:
library(stringi)
library(stringr)
# Read play in using stri_read_lines()
file = "C:/Users/Migue/datacamp R/String Manipulation with stringr in R/Importance of being earnest.txt"
earnest <- stri_read_lines(file)

# Find the lines that end the foreword and start of afterword by detecting the patterns 
# "START OF THE PROJECT" and "END OF THE PROJECT".

start <- str_which(earnest, fixed("START OF THE PROJECT"))
end <- str_which(earnest, fixed("END OF THE PROJECT"))

# Use the start and end positions to subset the play to the lines between (start + 1) and (end - 1).

earnest_sub  <- earnest[(start + 1):(end - 1)]

# Detect first act
lines_start <- str_which(earnest_sub, fixed("FIRST ACT"))

# Set up index
intro_line_index <- 1:(lines_start -1)

# Split play into intro and play
intro_text <- earnest_sub[intro_line_index]
play_text <- earnest_sub[-intro_line_index]

# Take a look at the first 20 lines
writeLines(head(play_text, 20))

### Identifying the lines, take 1
The first thing you might notice when you look at your vector play_text is there are lots of empty lines. They don't really affect your task so you might want to remove them. The easiest way to find empty strings is to use the stringi function stri_isempty(), which returns a logical you can use to subset the not-empty strings:

'# Get rid of empty strings

empty <- stri_isempty(play_text)

play_lines <- play_text[!empty]

So, how are you going to find the elements that indicate a character starts their line? Consider the following lines

Play_lines[10:15]

[1] "Algernon.  I'm sorry for that, for your sake.  I don't play"             
[2] "accurately--any one can play accurately--but I play with wonderful"      
[3] "expression.  As far as the piano is concerned, sentiment is my forte.  I"
[4] "keep science for Life."                                                  
[5] "Lane.  Yes, sir."                                                        
[6] "Algernon.  And, speaking of the science of Life, have you got the"

The first line is for Algernon, the next three strings are continuations of that line, then line 5 is for Lane and line 6 for Algernon.

How about looking for lines that start with a word followed by a .?

play_lines, containing the lines of the play as a character vector, has been pre-defined.

In [11]:
library(rebus)
# Build a pattern that matches the start of the line, followed by one or more word characters, then a period.
# Pattern for start, word then .
pattern_1 <- START %R% one_or_more(WRD) %R% DOT

# Use your pattern with str_view() to see the lines that matched, and those that didn't match. 
str_view(play_text, pattern = pattern_1, match = TRUE) 
str_view(play_text, pattern = pattern_1, match = FALSE)

"package 'rebus' was built under R version 3.6.3"
Attaching package: 'rebus'

The following object is masked from 'package:stringr':

    regex



In [13]:
# Try a more specific pattern: the start of the line, a capital letter, followed by one or more word characters, 
# then a full stop.

pattern_2 <-  START %R% ascii_upper() %R% one_or_more(WRD) %R% DOT

# As before, view the matched lines,
str_view(play_text, pattern = pattern_2, match = TRUE)
str_view(play_text, pattern = pattern_2, match = FALSE)

In [15]:
# Get subset of lines that match
lines <- str_subset(play_text, pattern = pattern_2)

# Extract match from lines
who <- str_extract(lines, pattern = pattern_2)

# Let's see what we have
unique(who)

### Identifying the lines, take 2
The pattern "starts with a capital letter, has some other characters then a full stop" wasn't specific enough. You ended up matching lines that started with things like University., July., London., and you missed characters like Lady Bracknell and Miss Prism.

Let's take a different approach. You know the characters names from the play introduction. So, try specifically looking for lines that start with their names. You'll find the or1() function from the rebus package helpful. It specifies alternatives but rather than each alternative being an argument like in or(), you can pass in a vector of alternatives.

We've created the characters vector  with all the characters names.

In [16]:
# Create vector of characters
characters <- c("Algernon", "Jack", "Lane", "Cecily", "Gwendolen", "Chasuble", 
  "Merriman", "Lady Bracknell", "Miss Prism")

# Match start, then character name, then .
pattern_3 <- START %R% or1(characters) %R% DOT

# View matches of pattern_3
str_view(play_text, pattern = pattern_3, match = TRUE)
  
# View non-matches of pattern_3
str_view(play_text, pattern = pattern_3, match = FALSE)


In [18]:
# Pull out matches
lines <-  str_subset(play_text, pattern = pattern_3)

# Extract match from lines
who <- str_extract(lines, pattern = pattern_3)

# Let's see what we have
unique(who)

# Count lines per character
table(who)

who
      Algernon.         Cecily.       Chasuble.      Gwendolen.           Jack. 
            201             154              42             102             219 
Lady Bracknell.           Lane.       Merriman.     Miss Prism. 
             84              21              17              41 

### Changing case to ease matching
A simple solution to working with strings in mixed case, is to simply transform them into all lower or all upper case. Depending on your choice, you can then specify your pattern in the same case.

For example, while looking for "cat" finds no matches in the following string,

x <- c("Cat", "CAT", "cAt") 
str_view(x, "cat")
transforming the string to lower case first ensures all variations match.

str_view(str_to_lower(x), "cat")
See if you can find the catcidents that also involved dogs. You'll see a new rebus function called whole_word(). The argument to whole_word() will only match if it occurs as a word on its own, for example whole_word("cat") will match cat in "The cat " and "cat." but not in "caterpillar".

A character vector of cat-related accidents has been pre-defined in your workspace as catcidents.

In [20]:
# load catcidents
catcidents <- readRDS(file = "catcidents.rds")

# see catcidents
head(catcidents)

# Construct pattern of DOG in boundaries
whole_dog_pattern <- whole_word("DOG")

# See matches to word DOG
str_view(catcidents, pattern = whole_dog_pattern, match = TRUE)

In [21]:
# Transform catcidents to upper case
catcidents_upper <- str_to_upper(catcidents)

# View matches to word "DOG" again
str_view(catcidents_upper, patter = whole_dog_pattern, match = TRUE)

In [22]:
# If you need to retain the original mixed case strings, you can use str_detect() on the transformed strings 
# to subset the original strings.

# Try it by creating has_dog from calling str_detect() on catcidents_upper with the upper case pattern.

# Which strings match?
has_dog <- str_detect(catcidents_upper, pattern = whole_dog_pattern)

# Pull out matching strings in original. Use has_dog and square brackets to subset catcidents.
catcidents[has_dog]

### Ignoring case when matching
Rather than transforming the input strings, another approach is to specify that the matching should be case insensitive. This is one of the options to the stringr regex() function.

Take our previous example,

x <- c("Cat", "CAT", "cAt") 

str_view(x, "cat")

To match the pattern cat in a case insensitive way, we wrap our pattern in regex() and specify the argument ignore_case = TRUE,

str_view(x, regex("cat", ignore_case = TRUE))

Notice that the matches retain their original case and any variant of cat matches.

Try it out to find the catcidents that involved tripping.

In [44]:
# First view the matches of catcidents to the pattern "TRIP". Notice how you only match those that are TRIP all in upper case.
str_view(catcidents, pattern = "TRIP", match = TRUE)

# Construct a case-insensitive regex to "TRIP" by calling regex() with ignore_case = TRUE. Assign the result to trip_pattern.
trip_pattern <- regex("TRIP", ignore_case = TRUE)

# Repeat your viewing of catcident trips, this time using the case insensitive trip_pattern. You should get a few more hits.
str_view(catcidents, pattern = trip_pattern, match = TRUE )

In [45]:
# Get subset of matches
str_subset(catcidents, pattern = regex("TRIP"))  # doesnt work with ignore_case = true , I do not know the reason

# Extract matches
str_extract(trip, pattern = regex("TRIP"))

### Fixing case problems
Finally, you might want to transform strings to a common case. You've seen you can use str_to_upper() and str_to_lower(), but there is also str_to_title() which transforms to title case, in which every word starts with a capital letter.

This is another situation where stringi functions offer slightly more functionality than the stringr functions. The stringi function stri_trans_totitle() allows a specification of the type which, by default, is "word", resulting in title case, but can also be "sentence" to give sentence case: only the first word in each sentence is capitalized.

Try outputting the catcidents in a consistent case.


In [46]:
# Store the first five elements of catcidents as cat5.
cat5 <- catcidents[1:5]

# Use writeLines() to display cat5.
writeLines(cat5)

# Repeat but now pass cat5 transformed to title case with str_to_title().
writeLines(str_to_title(cat5))

# Try again using stri_trans_totitle() instead. This should be identical to str_to_title().
writeLines(stri_trans_totitle(cat5))

# Finally, display the first 5 elements of cat5 transformed to sentence case, by passing 
# the type argument to stri_trans_totitle().

writeLines(stri_trans_totitle(cat5, 
  type = "sentence"))

79yOf Fractured fingeR tRiPPED ovER cAT ANd fell to FlOOr lAst nIGHT AT HOME*
21 YOF REPORTS SUS LACERATION OF HER LEFT HAND WHEN SHE WAS OPENING A CAN OF CAT FOOD JUST PTA. DX HAND LACERATION%
87YOF TRIPPED OVER CAT, HIT LEG ON STEP. DX LOWER LEG CONTUSION 
bLUNT CHest trAUma, R/o RIb fX, R/O CartiLAgE InJ To RIB cAge; 32YOM walKiNG DOG, dog took OfF aFtER cAt,FelL,stRucK CHest oN STepS,hiT rIbS
42YOF TO ER FOR BACK PAIN AFTER PUTTING DOWN SOME CAT LITTER DX: BACK PAIN, SCIATICA
79yof Fractured Finger Tripped Over Cat And Fell To Floor Last Night At Home*
21 Yof Reports Sus Laceration Of Her Left Hand When She Was Opening A Can Of Cat Food Just Pta. Dx Hand Laceration%
87yof Tripped Over Cat, Hit Leg On Step. Dx Lower Leg Contusion 
Blunt Chest Trauma, R/O Rib Fx, R/O Cartilage Inj To Rib Cage; 32yom Walking Dog, Dog Took Off After Cat,Fell,Struck Chest On Steps,Hit Ribs
42yof To Er For Back Pain After Putting Down Some Cat Litter Dx: Back Pain, Sciatica
79yof Fractured Finger Tripped