# Lab 7

This lab will cover a lot of review! We will touch on strings, regular expressions, factors, and dates.

## Table of Contents
* [Review](#Review)
* [Explore](#Explore)
* [Exercises](#Exercises)

In [210]:
library(tidyverse)
library(forcats)
library(lubridate)


Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date



## Review

### Strings

Strings are a data type within R that hopefully is fairly familiar to you by now! But essentially they are the main data type for storing text or character based data. A few special characters within a string are as follows:
* \ : Can be used to 'exit' the string to add characters
* Quotes ('', "") : Doesn't make a difference which one you use, but there are a handful of cases where mixing them could be useful
* \n : Interpreted as a new line within a string
* \t : Interpreted as a tab within a string
* \u : Reads the code following it corresponding to Unicode (https://unicode-table.com/en/)

In [52]:
#Putting all of these together
sk8boi = 'He was a   \u2620 sk8er boi \u2620\nShe said,\"see you later boy\"\nHe wasn\'t good enough for her'

In [53]:
#The raw string
sk8boi

In [54]:
#How it looks when written to a file
writeLines(sk8boi)

He was a   ☠ sk8er boi ☠
She said,"see you later boy"
He wasn't good enough for her


Within tidyverse (stringr) there are a bunch of helpful functions for working with text. I mentioned them last lab, but here they all are again:

1. Character manipulation **(str_length, str_sub, str_dup, str_c)**
2. Handling whitespace **(str_pad, str_trunc, str_trim, str_wrap)**
3. Handling country-specific strings using 'locale =' **(str_to_upper, str_to_lower, str_to_title, str_order, str_sort)**
4. Pattern matching normally using regex **(str_detect, str_subset, str_count, str_locate, str_locate_all, str_extract, str_extract_all, str_match, str_match_all, str_replace, str_replace_all, str_split, str_split_fixed)**

In [55]:
#Character manipulation
tmpstr = c("bAnana", "cAt", "bAr")
#Get the length of each string
str_length(tmpstr)

In [67]:
#Pull the second character from each string, or get the last character from each
str_sub(tmpstr,2,2)
str_sub(tmpstr,-1,-1)

In [59]:
#Duplicate each string twice
str_dup(tmpstr,2)

In [66]:
#Concatenate magic on each string, and combine them together
str_c(tmpstr,'+MAGIC')
str_c(tmpstr, collapse=",")

In [85]:
#Handling whitespace (Probably won't encounter much)
#Makes sure each element is at least 10 characters long
str_pad(tmpstr,10)

In [87]:
#Country specific
#Capitalizes the first letter
title = 'the old man and the sea'
str_to_title(title)
#Make it all capital
str_to_upper(title)

In [92]:
#Pattern matching
#This can be used to print outputs from variables
str_interp('we are looking the book called: ${title}',list(title=str_to_title(title)))

In [104]:
#This will find the @ within an email and give the start/end position of the REGEX pattern
email = 'lpuglisi@umich.edu'
str_locate(email,'@')

start,end
9,9


In [101]:
#We can use this to pull out that character
str_sub(email, str_locate(email,'@'))

In [105]:
#We can use this to split the email into two parts, now we could use the uniquename for other analysis
str_split(email,'@')

### Regular Expressions (REGEX)

Regular expressions are a syntax for finding patterns in text. Essentially, they are a set of standardized codes and characters that enable you to search for exactly the type of pattern you want to find in text. Shown below are the key codes that can be used within an expression. Note that these are only fundamentals, but should give you enough tools to match almost any type of pattern you are looking for.

Quantifiers:
* \* : matches any number of what's before it, from zero to infinity
* ? : matches zero or one
* \+ : matches one or more

Special characters:


* . : The dot matches any single character
* \n : Matches a newline character
* \t : Matches a tab (ASCII 9)
* \d : Matches a digit [0-9]
* \D : Matches a non-digit
* \w : Matches an alphanumeric character
* \W : Matches a non-alphanumeric character
* \s : Matches a whitespace character
* \S : Matches a non-whitespace character
* \	: Use \ to escape special characters. For example, \\. matches a dot, and \\\ matches a backslash
* ^ : Match at the beginning of the input string
* $ : Match at the end of the input string.

Character Classes:

* [abc] :	Match any of a, b, and c
* [a-z]	: Match any character between a and z
* [^abc] :	A caret ^ at the beginning indicates "not". In this case, match anything other than a, b, or c.
* [+\*?.] : 	Most special characters have no meaning inside the square brackets. This expression matches any of +, *, ? or the dot

Grouping:

This is mainly done with parentheses, note that the OR (|) operator can also be used.
* (ab)+ : Find as many cases of 'ab' as possible
* (aa|bb)+ : Find as many cases of 'aa' or 'bb' as possible
* a(\d+)a : Find 'a', then find as many numbers that exist before another 'a'

In [118]:
#First let's try to get everyone's uniqname
emails = c('lpuglisi@umich.edu','john@umich.edu','larry@umich.edu')
#Pattern searches for as many letters as there are, starting left to right (e.g. stop at the @)
re = "(\\w)+"
str_extract(emails,re)

In [125]:
#Now, let's say we want to pull the names associated with group 01
animalnames = c('01_Noodle','01_Momo','01_Meatball','02_Turd',
                '02_Lardball','02_Spudmuffin','03_ChairmanMeow','03_FelineCastro')

#Fairly simple with str_detect!
re = "1"
animalnames[str_detect(animalnames,re)]

In [149]:
#And if we want just get the names?
re = "[(A-Z)|(a-z)]+"
str_extract(animalnames,re)

### Factors

Factors are a common data type within R that are used to store categorical variables, or variables that only have a specific set of levels within them (e.g. gender, school grade, Qdoba rice type, etc.). They are very useful (and often required) for many types of algorithms and functions that try to make predictions. The forcats package in R has some useful functions for working with factors:
* fct_reorder: Reorder a 1d factor based on a variable
* fct_reorder2: Reorder a 2d factor based on two variables
* fct_relevel: Change the locations of the levels of the factor
* fct_infreq: Order factor levels by their frequency
* fct_rev: Reverse the order of the factor levels
* fct_recode: Manually change the levels of a factor
* fct_collapse: Collapse factor levels into manually defined groups

In [194]:
#Make some sample data to play around with
products = c('sponge','brush','brillo','duck','duck')
price = c(1.99,4.99,3.99,23.89,23.89)
df = data.frame(products,price)
str(df)

'data.frame':	5 obs. of  2 variables:
 $ products: Factor w/ 4 levels "brillo","brush",..: 4 2 1 3 3
 $ price   : num  1.99 4.99 3.99 23.89 23.89


In [195]:
#Reorder the products based on the price
df$products2 = fct_reorder(df$products,df$price)
df

products,price,products2
sponge,1.99,sponge
brush,4.99,brush
brillo,3.99,brillo
duck,23.89,duck
duck,23.89,duck


In [196]:
#Notice that the data looks the same but the levels are set based on the price (ggplot would plot in these orders)
levels(df$products)
levels(df$products2)

In [197]:
#Relevel duck second position
df$products2 = fct_relevel(df$products2, "duck", after=1)
levels(df$products)
levels(df$products2)

In [198]:
#Maybe instead of 'duck' we actually meant 'soap'
df$products2 = fct_recode(df$products2, soap = "duck")
levels(df$products)
levels(df$products2)

In [199]:
#Collapse lets us make a new category that stores multiple levels of a given factor
#Notice the current levels of the factor
levels(gss_cat$partyid)
#Now we can group them and count them
gss_cat %>%
  mutate(partyid = fct_collapse(partyid,
    Other = c("No answer", "Don't know", "Other party"),
    Republican = c("Strong republican", "Not str republican"),
    Independent = c("Ind,near rep", "Independent", "Ind,near dem"),
    Democrat = c("Not str democrat", "Strong democrat")
  )) %>%
  count(partyid)

partyid,n
Other,548
Republican,5346
Independent,8409
Democrat,7180


In [209]:
#Now we may want to order a factor in terms of how many times it shows up in a data set
df
print("Now reorder based on frequency: ")
levels(df$products2)
levels(fct_infreq(df$products2))

products,price,products2
sponge,1.99,sponge
brush,4.99,brush
brillo,3.99,brillo
duck,23.89,soap
duck,23.89,soap


[1] "Now reorder based on frequency: "


### Dates

There was a past lab that already explained a lot of the basics of dates in R, but here I'll summarize some of the content that has been covered so far. Most of the work you can do with dates can be done by using lubridate. Within R, dates have three main data types: dates, times, datetimes. Shown below are some of the most useful functions around dates.

Current dates:
* today() : today's date
* now() : today's date and time

Converting strings to dates (use locale to set different languages):
* ymd("2017-01-31") : year/month/day
* mdy("January 31st, 2017") : month/day/year
* dmy("01-31-2017"): day/month/year
* ymd_hms("2017-01-31 20:11:59"): year/month/day hour/minute/second
* mdy_hm("01/31/2017 08:01"): month/day/year hour/minut
* make_date(): Requires a year, month, and day to output a date
* make_datetime(): Equivalent to make_date for datetimes

Datetime components:
* year()
* month()
* mday(): day of the month
* yday(): day of the year
* wday(): day of the week
* hour()
* minute()
* second()

Datetime durations (These can be added or subtracted onto dates):
* dyears(): Specify number of years
* dweeks()
* ddays()
* dhours()
* dminutes()

Timezones: Note that every time can be specified in a given timezone, UTC is the default but this can be changed


## Explore

This week will focus on the review and exercises!

## Exercises

### Section 14

In [None]:
#In your own words, describe the difference between the sep and collapse arguments to str_c().

In [None]:
#Use str_length() and str_sub() to extract the middle character from a string. 
#What will you do if the string has an even number of characters?

In [None]:
# What does str_wrap() do? When might you want to use it?

In [None]:
#Write a function that turns (e.g.) a vector c(“a”, “b”, “c”) into the string a, b, and c. 
#Think carefully about what it should do if given a vector of length 0, 1, or 2.

In [None]:
#How would you match the sequence "'\ ?

In [None]:
# How would you match the literal string "$^$"?

In [None]:
# Given the corpus of common words in stringr::words, create regular expressions that find all words that:
#Start with “y”.
#End with “x”
#Are exactly three letters long. (Don’t cheat by using str_length()!)
#Have seven letters or more.

In [None]:
# Create regular expressions to find all words that:
#Start with a vowel.
#That only contain consonants. (Hint: thinking about matching “not”-vowels.)
#End with ed, but not with eed.
#End with ing or ise.

In [None]:
# Empirically verify the rule ``i before e except after c’’.

### Section 15

In [None]:
#Which relig does denom (denomination) apply to? How can you find out with a table? 
#How can you find out with a visualization?

In [None]:
# For each factor in gss_cat identify whether the order of the levels is arbitrary or principled.

In [None]:
# How could you collapse rincome into a small set of categories?

### Section 16

In [None]:
# What happens if you parse a string that contains invalid dates?

In [None]:
# Use the appropriate lubridate function to parse each of the following dates:
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14"

In [None]:
# How does the distribution of flight times within a day change over the course of the year?

In [None]:
# How does the average delay time change over the course of a day? Should you use dep_time or sched_dep_time? Why?

In [None]:
#Confirm my hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by 
#scheduled flights that leave early. 
#Hint: create a binary variable that tells you whether or not a flight was delayed.

In [None]:
# Why is there months() but no dmonths()?

In [None]:
# Create a vector of dates giving the first day of every month in 2015. 
# Create a vector of dates giving the first day of every month in the current year.