# Week 5: R Utilities, Regular Expressions, etc.

The material this week focused on various "utilities" in R, basically just useful functions that might come in handy. We could probably fill an entire class with such functions, but I want to spend some time concentrating on a particular utility that was covered very briefly in DataCamp: regular expressions. 

In my opinion, regular expressions are among the most valuable tools you can master in programming because they make working with messy, unformatted data a lot more palatable. And unlike a lot of the tools you'll learn in R, regular expressions are fairly language-agnostic. For instance, I learned them in Python years ago, but the metacharacters are generally the same regardless of language, so the transition has been pretty smooth. 

To start with, I want to briefly remind you of some really useful regex stuff:

## Common Metacharacters in Regex

In case the terminology is confusing, the "metacharacters" in regex refer to symbols that aren't interpreted literally. For instance, the "." character isn't actually interpreted as a dot; rather, in regex it means "any character". So the following produces a match:

In [4]:
my.string <- "Hi there!"
grepl(".",my.string)

Basically, all I did was tell R to look for any character. It'll even work on a string composed of white spaces:

In [5]:
my.string <- "  "
grepl(".",my.string)

Some metacharacters tell R what to look for, others tell R how many times to look for it, and others tell R where to look for it. So, using those 3 categories as a guide, here's some common metacharacters that prove exceedingly useful:

### Character types

* . matches anything; it's more useful than it sounds
* \\d matches a didget, so any numeric character
* \\s matches a space
* \\n matches a new line character
* \\w matches a word, so any alphabetical character

It's also worth noting that for the "\\x" notation, a capitalized version of the same letter will match the opposite (i.e. "\\X" matches anything that isn't matched by "\\x"). Here are a couple examples to illustrate:

In [11]:
my.string <- "Thishasnospaces"
grepl("\\s",my.string)

In [12]:
grepl("\\S",my.string)

### Character repeats

For each of these, you'll want to follow your character or expression with one of these symbols; I'll demonstrate once I've defined all of them:

* ? matches 0 or 1 occurence
* \+ matches 1 or more occurence
* \* matches 0 or more occurences
* {m,n} matches at least m, but not more than n occurences
* {m,} matches at least m occurences
* {,n} matches no more than n occurences
* {m} matches exactly m occurences

Here's a few examples to illustrate:

In [20]:
my.string1 <- "My phone number is 8675309"
my.string2 <- "My address is 1274 King St."
my.strings <- c(my.string1, my.string2)

# Match only the phone number using an exact number of didgets:
grepl("\\d{7}",my.strings)

In [21]:
# Match both by looking for 3-8 didgets
grepl("\\d{3,8}",my.strings)

In [23]:
# Match only if there's at least 1 number 8
grepl("8+",my.strings)

In [24]:
# Match if there's 0 or 1 number 8
grepl("8?",my.strings)

### Character Placement

Where your match occurs can be important; there's 4 metacharacters that are useful here:

* ^ matches the pattern only at the start of a string
* $ matches the pattern only at the end of a string
* \\b matches the pattern at either end
* \\B matches the pattern only if it's not at an end

Some examples:

In [25]:
my.string1 <- 'Fred is a good friend'
my.string2 <- 'You could say my best friend is Fred'
my.strings <- c(my.string1, my.string2)
# Look for Fred only at the start
grepl("^Fred",my.strings)

In [26]:
# Look for Fred only at the end
grepl("Fred$",my.strings)

In [27]:
# Look for Fred at either end
grepl("\\bFred",my.strings)

In [28]:
# Look for Fred in the middle only
grepl("\\BFred",my.strings)

### A few other useful things

There's a few other things that are really useful and I should point out quickly:

* [something] looks for anything inside the brackets; you can use hyphens here, so "[0-9]" looks for any didget and is equivalent to "\\d"
* [^something] looks for anything that is NOT inside the brackets, so the "^" inside square brackets effectively means "not"
* (something) allows you to group patterns to be more specific; it also allows you to capture the matched pattern, which is super useful for changing things with `sub` or `gsub`
* \\X pulls out matches captured by the parentheses, where X is a didget 1 or greater. So \\1 returns the first match, \\2 returns the second match, etc. 
* | acts as an "or" to match alternative patterns
* & acts as an "and" to match multiple patterns

Some quick examples:

In [36]:
my.string1 <- 'sandwiches'
my.string2 <- 'pizza'
my.strings <- c(my.string1, my.string2)

# Look for the letters "z" and "w"
grepl("[zw]", my.strings)

In [37]:
# Look for anythign but letters that appear in "Pizza"
grepl("[^piza]", my.strings)

In [38]:
my.string1 <- 'Fred is a good friend'
my.string2 <- 'You could say my best friend is Fred'
my.strings <- c(my.string1, my.string2)

# If you see "good" or "best", insert "very" in front
gsub("(good|best)","very \\1",my.strings)

## An illustrative activity: parsing KEGG compounds

KEGG is a common resource used for metabolic studies and as part of its resources, it contains a list of compounds and their metadata. I am interested in extracting all of the compounds from KEGG and creating a data frame where each compound record has the KEGG compound ID, a human-readable name, and the chemical formula. Here's what the top of the KEGG compounds file looks like:

In [50]:
readLines("./keggcompounds.txt", n = 20)

You can also take a look at the file itself to better understand its formatting. As with last week, I'll post a solution below, but I'm looking for a data frame with:

* Each compound ID
* Each corresponding name (first entry only)
* Each corresponding formula, if applicable

As with last week's task, you might find yourself having to search around for answers along the way, and I'd highly recommend writing some pseudocode to guide you. 