Skip to content
ijlyttle edited this page Jan 20, 2013 · 47 revisions

Introduction

Regular expressions are power tools for strings. Actually, it may be more accurate to say that regular expressions are like the bits for power tools for strings, as the tools themselves are functions. Many people are put off, and rightly so, by the presentation of dense, obscure code. However, generous use of comments can make even the most twisted regular-expression code quite comprehensible.

Regular expressions were popularized through the grep utility in UNIX. For this author, this is one of the few things he learned in the mid-1990's that he still uses regularly (pardon the pun). Any time you need to manipulate text, proper use of regular expressions will almost assuredly make the job easier and more robust.

At its most basic, a regular expression is used with functions to identify a patterns within a string, and sometimes to divide that identified pattern into groups. Functions can also be used with regular expressions to extract, manipulate, and/or substitute back into the original text.

Here are some example regular expressions, "borrowed" from the excellent Wikipedia Page:

  • hat matches "hat"
  • [hc]at matches "hat" and "cat"
  • .at matches any three-character string ending with "at", including "hat", "cat", and "bat"
  • [^b]at matches all strings matched by .at except "bat"
  • [^hc]at matches all strings matched by .at other than "hat" and "cat"
  • [hc]at$ matches "hat" and "cat", but only at the end of the string or line
  • ^[hc]at matches "hat" and "cat", but only at the beginning of the string or line

This repository is meant to be only the briefest of introductions to the richness and capability of regular expressions. The goal is allow the user to get started with regular expressions, and the functions within the R package stringr that use regular expressions.

How to use this repository

The goal of this repository is to demonstrate the use of regular expressions, using R. Towards this end, three activities are proposed to the user:

  1. Install -- Download the requisite R packages.
  2. Learn -- Watch a series of YouTube videos. Follow along with R code.
  3. Practice -- Perform a series of exercises using regular expressions using R.

References

Acknowledgements

This repository is based on a number of packages written by Hadley Wickham. The format for the exercise documentation is inspired by the problem-sets for Andrew Ng's course on Machine Learning, offered at Coursera. As well, Paul Buda and Sylvain Marié have provided valuable feedback on the presentation of this Wiki.

Further study

Tony Gray asked the question about lookaround regular-expressions. As he noted, these can be useful as they do not "consume" characters. A couple of notes on lookarounds using stringr:

  • Lookarounds are supported in perl-style regular expressions, which means that we have to use the perl() function on our regular expression.
  • We cannot use str_match() or str_match_all() because the base-R function it wraps does not support perl-style expressions.

A quick example (left to the reader to discover lookahead, lookbehind):

> str_replace_all("baseball base", perl("base(?=ball)"), "dodge")
[1] "dodgeball base"