## Utilities

Mastering R programming is not only about understanding its programming concepts. Having a solid understanding of a wide range of R functions is also important. This chapter introduces you to many useful functions for data structure manipulation, regular expressions, and working with times and dates.

### Mathematical utilities
Have another look at some useful math functions that R features:

1. abs(): Calculate the absolute value.
2. sum(): Calculate the sum of all the values in a data structure.
3. mean(): Calculate the arithmetic mean.
4. round(): Round the values to 0 decimal places by default. Try out ?round in the console for variations of round() and ways to change the number of digits to round to.

As a data scientist in training, you've estimated a regression model on the sales data for the past six months. After evaluating your model, you see that the training error of your model is quite regular, showing both positive and negative values.

In [2]:
# The errors vector has already been defined for you
errors <- c(1.9, -2.6, 4.0, -9.5, -3.4, 7.3)

# Sum of absolute rounded values of errors
(sum(abs(round(errors))))

### Data Utilities
R features a bunch of functions to juggle around with data structures::

1. seq(): Generate sequences, by specifying the from, to, and by arguments.
2. rep(): Replicate elements of vectors and lists.
3. sort(): Sort a vector in ascending order. Works on numerics, but also on character strings and logicals.
4. rev(): Reverse the elements in a data structures for which reversal is defined.
5. str(): Display the structure of any R object.
6. append(): Merge vectors or lists.
7. is.*(): Check for the class of an R object.
8. as.*(): Convert an R object from one class to another.
9. unlist(): Flatten (possibly embedded) lists to produce a vector.

Remember the social media profile views data? Your LinkedIn and Facebook view counts for the last seven days are already defined as lists on the right.

In [3]:
# The linkedin and facebook lists have already been created for you
linkedin <- list(16, 9, 13, 5, 2, 17, 14)
facebook <- list(17, 7, 5, 16, 8, 13, 14)

# Convert linkedin and facebook to a vector: li_vec and fb_vec
(li_vec<-as.vector(as.numeric(linkedin)))
(fb_vec<-as.vector(as.numeric(facebook)))
# Append fb_vec to li_vec: social_vec
(social_vec=as.numeric(append(li_vec,fb_vec)))
# Sort social_vec
(sort(social_vec,decreasing = TRUE))

### Beat Gauss using R
There is a popular story about young Gauss. As a pupil, he had a lazy teacher who wanted to keep the classroom busy by having them add up the numbers 1 to 100. Gauss came up with an answer almost instantaneously, 5050. On the spot, he had developed a formula for calculating the sum of an arithmetic series. There are more general formulas for calculating the sum of an arithmetic series with different starting values and increments. Instead of deriving such a formula, why not use R to calculate the sum of a sequence?

In [4]:
# Create first sequence: seq1
seq1=seq(1,500,by=3)

# Create second sequence: seq2
seq2=seq(1200,900,by=-7)
# Calculate total sum of the sequences
(ssq=sum(sum(seq1),sum(seq2)))

### grepl & grep
In their most basic form, regular expressions can be used to see whether a pattern exists inside a character string or a vector of character strings. For this purpose, you can use:

1. grepl(), which returns TRUE when a pattern is found in the corresponding character string.
2. grep(), which returns a vector of indices of the character strings that contains the pattern.

Both functions need a pattern and an x argument, where pattern is the regular expression you want to match for, and the x argument is the character vector from which matches should be sought.

In this and the following exercises, you'll be querying and manipulating a character vector of email addresses! The vector emails has already been defined in the editor on the right so you can begin with the instructions straight away!

In [5]:
# The emails vector has already been defined for you
emails <- c("john.doe@ivyleague.edu", "education@world.gov", "dalai.lama@peace.org",
            "invalid.edu", "quant@bigdatacollege.edu", "cookie.monster@sesame.tv")

# Use grepl() to match for "edu"
grepl("edu",emails)

# Use grep() to match for "edu", save result to hits
(hits<-grep("edu",emails))

# Subset emails using hits
emails[hits]

### grepl & grep (2)
You can use the caret, ^, and the dollar sign, $ to match the content located in the start and end of a string, respectively. This could take us one step closer to a correct pattern for matching only the ".edu" email addresses from our list of emails. But there's more that can be added to make the pattern more robust:

1. @, because a valid email must contain an at-sign.

2. .*, which matches any character (.) zero or more times (*). Both the dot and the asterisk are metacharacters. You can use them to match any character between the at-sign and the ".edu" portion of an email address.

3. '\\.edu$, to match the ".edu" part of the email at the end of the string. The \\ part escapes the dot: it tells R that you want to use the . as an actual character.

In [6]:
# The emails vector has already been defined for you
emails <- c("john.doe@ivyleague.edu", "education@world.gov", "dalai.lama@peace.org",
            "invalid.edu", "quant@bigdatacollege.edu", "cookie.monster@sesame.tv")

# Use grepl() to match for .edu addresses more robustly
grepl("@.*\\.edu$",emails)

# Use grep() to match for .edu addresses more robustly, save result to hits
(hits<-grep("@.*\\.edu$",emails))

# Subset emails using hits
emails[hits]

### sub & gsub
While grep() and grepl() were used to simply check whether a regular expression could be matched with a character vector, sub() and gsub() take it one step further: you can specify a replacement argument. If inside the character vector x, the regular expression pattern is found, the matching element(s) will be replaced with replacement.sub() only replaces the first match, whereas gsub() replaces all matches.

Suppose that emails vector you've been working with is an excerpt of DataCamp's email database. Why not offer the owners of the .edu email addresses a new email address on the datacamp.edu domain? This could be quite a powerful marketing stunt: Online education is taking over traditional learning institutions! Convert your email and be a part of the new generation!

In [7]:
# The emails vector has already been defined for you
emails <- c("john.doe@ivyleague.edu", "education@world.gov", "global@peace.org",
            "invalid.edu", "quant@bigdatacollege.edu", "cookie.monster@sesame.tv")

# Use sub() to convert the email domains to datacamp.edu
sub("@.*\\.edu$", "@datacamp.edu", emails)

### Right here, right now
In R, dates are represented by Date objects, while times are represented by POSIXct objects. Under the hood, however, these dates and times are simple numerical values. Date objects store the number of days since the 1st of January in 1970. POSIXct objects on the other hand, store the number of seconds since the 1st of January in 1970.

The 1st of January in 1970 is the common origin for representing times and dates in a wide range of programming languages. There is no particular reason for this; it is a simple convention. Of course, it's also possible to create dates and times before 1970; the corresponding numerical values are simply negative in this case.

In [8]:
# Get the current date: today
(today<-Sys.Date())

# See what today looks like under the hood
unclass(today)

# Get the current time: now
(now<-Sys.time())

# See what now looks like under the hood
(unclass(now))

[1] "2021-01-14 23:07:56 CET"

### Create and format dates
To create a Date object from a simple character string in R, you can use the as.Date() function. The character string has to obey a format that can be defined using a set of symbols (the examples correspond to 13 January, 1982):

1. %Y: 4-digit year (1982)
2. %y: 2-digit year (82)
3. %m: 2-digit month (01)
4. %d: 2-digit day of the month (13)
5. %A: weekday (Wednesday)
6. %a: abbreviated weekday (Wed)
7. %B: month (January)
8. %b: abbreviated month (Jan)

The following R commands will all create the same Date object for the 13th day in January of 1982:

as.Date("1982-01-13")

as.Date("Jan-13-82", format = "%b-%d-%y")

as.Date("13 January, 1982", format = "%d %B, %Y")

Notice that the first line here did not need a format argument, because by default R matches your character string to the formats "%Y-%m-%d" or "%Y/%m/%d".

In [9]:
# Definition of character strings representing dates
str1 <- "May 23, '96"
str2 <- "2012-03-15"
str3 <- "30/January/2006"

# Convert the strings to dates: date1, date2, date3
(date1 <- as.Date(str1, format = "%b %d, '%y"))
(date2 <- as.Date(str2, format = "%Y-%m-%d"))
(date3 <- as.Date(str3, format = "%d/%B/%Y"))


# Convert dates to formatted strings
format(date1, "%A")
format(date2, "%d")
format(date3, "%b %Y")

### Create and format times
Similar to working with dates, you can use as.POSIXct() to convert from a character string to a POSIXct object, and format() to convert from a POSIXct object to a character string. Again, you have a wide variety of symbols:

1. %H: hours as a decimal number (00-23)
2. %I: hours as a decimal number (01-12)
3. %M: minutes as a decimal number
4. %S: seconds as a decimal number
5. %T: shorthand notation for the typical format %H:%M:%S
6. %p: AM/PM indicator

For a full list of conversion symbols, consult the strptime documentation in the console:

?strptime

Again,as.POSIXct() uses a default format to match character strings. In this case, it's %Y-%m-%d %H:%M:%S. In this exercise, abstraction is made of different time zones.

In [10]:
# Definition of character strings representing times
str1 <- "May 23, '96 hours:23 minutes:01 seconds:45"
str2 <- "2012-3-12 14:23:08"

# Convert the strings to POSIXct objects: time1, time2
time1 <- as.POSIXct(str1, format = "%B %d, '%y hours:%H minutes:%M seconds:%S")
time2 <- as.POSIXct(str2)

# Convert times to formatted strings
format(time1, "%M")
format(time2, "%I:%M %p")

### Calculations with Dates
Both Date and POSIXct R objects are represented by simple numerical values under the hood. This makes calculation with time and date objects very straightforward: R performs the calculations using the underlying numerical values, and then converts the result back to human-readable time information again.

You can increment and decrement Date objects, or do actual calculations with them (try it out in the console!):

today <- Sys.Date()

today + 1

today - 1

as.Date("2015-03-12") - as.Date("2015-02-27")

To control your eating habits, you decided to write down the dates of the last five days that you ate pizza. In the workspace, these dates are defined as five Date objects, day1 to day5. The code on the right also contains a vector pizza with these 5 Date objects.

In [12]:
today <- Sys.Date()
day1 = today +1
day2 = today +5
day3 = today +2
day4 = today +7
day5 = today +4

# day1, day2, day3, day4 and day5 are already available in the workspace

# Difference between last and first pizza day
print(day5-day1)

# Create vector pizza
pizza <- c(day1, day2, day3, day4, day5)

# Create differences between consecutive pizza days: day_diff
(day_diff =diff(pizza))
# Average period between two consecutive pizza days
mean(day_diff)

Time difference of 3 days


Time differences in days
[1]  4 -3  5 -3

Time difference of 0.75 days

### Time is of the essence
The dates when a season begins and ends can vary depending on who you ask. People in Australia will tell you that spring starts on September 1st. The Irish people in the Northern hemisphere will swear that spring starts on February 1st, with the celebration of St. Brigid's Day. Then there's also the difference between astronomical and meteorological seasons: while astronomers are used to equinoxes and solstices, meteorologists divide the year into 4 fixed seasons that are each three months long. (source: www.timeanddate.com)

A vector astro, which contains character strings representing the dates on which the 4 astronomical seasons start, has been defined on your workspace. Similarly, a vector meteo has already been created for you, with the meteorological beginnings of a season.

In [15]:
astro = c("20-Mar-2015", "25-Jun-2015", "23-Sep-2015", "22-Dec-2015") 
meteo = c("March 1, 15", "June 1, 15", "September 1, 15", "December 1, 15") 

# Convert astro to vector of Date objects: astro_dates
astro_dates <- as.Date(astro, format = "%d-%b-%Y")

# Convert meteo to vector of Date objects: meteo_dates
meteo_dates <- as.Date(meteo, format = "%B %d, %y")

# Calculate the maximum absolute difference between astro_dates and meteo_dates
max(abs(meteo_dates - astro_dates))

Time difference of 24 days