# Regular Expressions

Google revolutionized the internet by making it extremely searchable. This kind of power is at your fingertips.
Regular expressions are a set of syntatical tools for matching patterns in text. As scientists, we can use this powerful syntax to do many mundane tasks such as finding and sorting files, find and fixing formatting errors, and reformatting data.

- Navigate the command line more efficiently.
- Quickly find files on the command line based on their content (grep).
- Find and replace a complex expression in many files at once (sed).
- Quickly do math on plain-text columns of data from the command line (awk).




### Excercise: Command Line Regex

On the command line, there are many tools that rely on regular expressions. 

Can you think of a part of the command line tutorial that involved matching a pattern? Recall that the IPython notebook uses the exclamation point to access the command line. In the space below, list all files in this directory that end in the extension 'ipynb' on the command line.

In [None]:
!ls *.ipynb

## Regex Base Rules

- Alphanumeric characters match themselves.
- A dot (.) matches any character.
- Repeating patterns are matched with *, +, and ?.
- Character sets ([]) and the or operator (|) can match alternatives.
- The position markers ^ and $ match the beginning and end of a line, respectively.
- Parentheses can group things and extract information from matches.

### Exercise: Escaping the Escape Character

1. Using the ! magic in the notebook, access the terminal.

In [None]:
!ls

2. Try to create a file that has a backslash in the filename with a command like touch file\name.

In [None]:
!touch file\name

3. Use ls to examine the file you’ve just created. Did it work? Where is the slash?


In [None]:
ls *name

4. Use what you’ve just learned to escape the escape character. Can you successfully make a file called file\name?

In [None]:
!touch file\\name
!ls *name

### Exercise: Reverse-Engineer a Regex

The following string will find either .dat or .DAT extended files:

```bash
~ $ find . -regextype posix-extended -regex ".*\(\.dat\|\.DAT\)"
```

1. Can you tell why?
2. What are the slashes there for?
3. What about the extra specification of -regextype posix-extended?
4. Can you find out what that means from the man page for find?

Discuss these questions with a neighbor and note the answer below:

### Exercise: Redirect sed Output to a File

1. Execute a sed command on a file in your filesystem (try something
simple like "s/the/THE/g").
2. Note that the altered file text has appeared on the command line.
3. Using your knowledge of redirection (from Chapter 1), reexecute
the command, this time sending the output to a temporary

In [None]:
# First, import the regular expression module.
import re

In [None]:
# The string matches the pattern, so a match is returned.
re.match("20[01][0-9].*[0-9][0-9].*[0-9][0-9]", '2015-12-16')

In [None]:
# Assign the match to a variable name for later use
m = re.match("20[01][0-9].*[0-9][0-9].*[0-9][0-9]", '2015-12-16')

In [None]:
# Find the index in the string of the start of the match.
m.pos

In [None]:
# Report all captured groups. This regular expression pattern had no capturing
# parentheses, so no substrings are reported.
m.groups()

In [None]:
# Try to match the date pattern against something that is not a date.
m = re.match("20[01][0-9].*[0-9][0-9].*[0-9][0-9]", 'not-a-date')

In [None]:
# Note how None is returned when the match fails.
m is None

### The compile() method

To speed up matching multiple strings against a common pattern, it is always a good
idea to compile() the pattern. Compiling takes much longer than matching. However,
once you have a compiled pattern, all of the same functions are available as methods
of the pattern. Since the pattern is already known, you don’t need to pass it in when
you call match() or search() or the other methods. Let’s compile a version of the date
regular expression that has capturing parentheses around the actual date values:

In [None]:
# Compile the regular expression and store it as the re_date variable.
re_date = re.compile("(20[01][0-9]).*([0-9][0-9]).*([0-9][0-9])")

In [None]:
# Use this variable to match against a string.
re_date.match('2014-28-01')

In [None]:
# Assign the match to a variable m for later use.
m = re_date.match('2014-28-01')

In [None]:
# Since the regular expression uses capturing parentheses, you can obtain the values
# within them using the groups() method. A tuple that has the same length as
# the number of capturing parentheses is returned.
m.groups()

## Regular Expressions Wrap-Up

At this point, your regular expressions skills should include:
- How to speed up command-line use with metacharacters
- How to find files based on patterns in their names (find)
- How to find lines in files based on patterns in their content (grep)
- How to replace text patterns in files (sed)
- How to manipulate columns of data based on patterns (awk)