# Syntax

Now that we've seen what regular expressions are, what they're good for, let's get down to business. Learning to use regular expressions is mostly about learning regular expression syntax, the special ways we can characters together to make regular expressions. This notebook will be the bulk of our workshop.

## Regular expression syntax<a id='section 1'></a>

All regular expressions are composed of two types of characters: 
* Literals (normal characters)
* Metacharacters (special characters)

### Matching characters exactly

Literals match exactly what they are, they mean what they say. For example, the regular expression `Berkeley` will match the string "Berkeley". (It won't match "berkeley", "berkeeley" or "berkely"). Most characters are literals.

In the example below, the regular expression `regular` will match the string "regular" exactly.

In [7]:
import re
pattern = 'R?egular'
test_string = 'we are practising our regular expressions Regular,we are fouced lerners'
re.findall(pattern, test_string)

['egular', 'Regular']

In [8]:
test_string.count("Regular")

1

### Matching special patterns

Metacharacters don't match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them or changing their meaning. For example, you might want find all mentions of "dogs" in a text, but you also want to include "dog". That is, you want to match "dogs" but you don't care if the "s" is or isn't there. Or you might want to find the word "the" but only at the beginning of a sentence, not in the middle. For these out-of-the-ordinary patterns, we use metacharacters.

In this workshop, we'll discuss the following metacharacters:

. ^ $ * + ? { } [ ] \ | ( )

In [9]:
pattern = 'dogs?'
test_string = "I like dogs, but my dog doesn't like me."
re.findall(pattern, test_string)

['dogs', 'dog']

In [10]:
pattern = '^the'
test_string = "the best thing about the theatre is the atmosphere"
re.findall(pattern, test_string)

['the']

### Our first metacharacters: [ and ]

The first metacharacters we’ll look at are [ and ]. They’re used for specifying a character class, which is a set of characters that you wish to match.

In [11]:
vowel_pattern = '[aeiou]'
test_string = 'The first metacharacters we’ll look at are '
re.findall(vowel_pattern, test_string)

['e', 'i', 'e', 'a', 'a', 'a', 'e', 'e', 'o', 'o', 'a', 'a', 'e']

#### Challenge 2
Find all the p's and q's in the test string below.

In [13]:
test_string = "Quick, there's a large goat filled with pizzaz. Is there a path to the queen of Zanzabar?"

['p', 'p', 'q']

#### Challenge 3
Find all the vowels in the test sentence below.

In [7]:
test_string = 'the quick brown fox jumped over the lazy dog'

['u',
 'i',
 'e',
 'e',
 'a',
 'a',
 'e',
 'o',
 'a',
 'i',
 'e',
 'i',
 'i',
 'a',
 'e',
 'e',
 'a',
 'a',
 'o',
 'e',
 'u',
 'e',
 'e',
 'o',
 'a',
 'a',
 'a']

### Ranges

Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'. For example, `[abc]` will match any of the characters a, b, or c; this is the same as `[a-c]`.

#### Challenge 4
Find all the capital letters in the following string.

In [15]:
test_string = 'The 44th President of the United States of America was Barack Obama'

['T', 'P', 'U', 'S', 'A', 'B', 'O']

### Complements

You can match the characters not listed within the class by complementing the set. This is indicated by including a `^` as the first character of the class; `^` outside a character class will simply match the `^` character. For example, `[^5]` will match any character except `5`.

In [40]:
everything_but_t = '[^tq]'
test_string = 'the quick brown fox jumped over the lazy dog'
re.findall(everything_but_t, test_string)[:5]

['h', 'e', ' ', 'u', 'i']

#### Challenge 5
Find all the consonants in the test sentence below.

In [17]:
test_string = 'the quick brown fox jumped over the lazy dog'

['t',
 'h',
 'e',
 'q',
 'u',
 'i',
 'c',
 'k',
 'b',
 'r',
 'o',
 'w',
 'n',
 'f',
 'o',
 'x',
 'j',
 'u',
 'm',
 'p',
 'e',
 'd',
 'o',
 'v',
 'e',
 'r',
 't',
 'h',
 'e',
 'l',
 'a',
 'z',
 'y',
 'd',
 'o',
 'g']

#### Challenge 6
Find all the `^` characters in the following test sentence.

In [19]:
test_string = """You can match the characters not listed within the class by complementing the set. 
This is indicated by including a ^ as the first character of the class; 
^ outside a character class will simply match the ^ character. 
For example, [^5] will match any character except 5."""

### Matching metacharacters literally

Challenge 6 is a bit of a trick. The problem is that we want to match the `^` character, but it's interpreted as a metacharacter, a character which has a special meaning. If we want to literally match the `^`, we have to "escape" its special meaning. For this, we use the `\`.

#### Challenge 7
Find all the square brackets `[` and `]` in the following test string

In [25]:
test_string = "The first metacharacters  we'll look at are [ and ]."

['[', '[']

### Character classes

The backslash `\` has another use in regexes, in addition to escaping metacharacters. It's used as the first character in special two-character combinations that have special meanings. These special two-character combinations are really shorthand for sets of characters.

|      Character     |       Meaning      |   Shorthand for  |
|:------------------:|:------------------:|:----------:|
|        `\d`        |      any digit     | `[0-9]` |
|        `\D`        |    any non-digit   |    `[^0-9]`    |
|        `\s`        |   any whitespace   |    `[ \t\n\r\f\v]`    |
|        `\S`        | any non-whitespace |    `[^ \t\n\r\f\v]`    |
|        `\w`        |      any word      |    `[a-zA-Z0-9_]`    |
| what do you think? |    any non-word    |         `?`   |

Now here's a quick tip. When writing regular expressions in Python, use raw strings instead of normal strings. Raw strings are preceded by an `r` in Python code. If we don't, the Python interpreter will try to convert backslashed characters before passing them to the regular expression engine. This will end in tears. You can read more about this [here](https://docs.python.org/3/library/re.html#module-re).

#### Challenge 8
Find all three digit prices in the following test sentence. Remember the `$` is a metacharacter so needs to be escaped.

In [36]:
test_string = 'The iPhone X costs over $999, while the Android competitor comes in at around $5786875.'

In [37]:
re.findall(r'\$\d+', test_string)

['$999', '$5786875']

### Repeating things

Being able to match varying sets of characters is the first thing regular expressions can do that isn’t already possible with the methods available on strings. However, if that was the only additional capability of regexes, they wouldn’t be much of an advance. Another capability is that you can specify that portions of the RE must be repeated a certain number of times.

| Character |        Meaning        |    Example    |                Matches               |
|:---------:|:---------------------:|:-------------:|:------------------------------------:|
|   `{n}`   |    exactly n times    |     `a{3}`    |                 'aaa'                |
|  `{n, m}` | between n and m times | `[1-9]{2, 4}` |          '12', '123', '1234'         |
|    `?`    |      0 or 1 times     |   `colou?r`   |           'color', 'colour'          |
|    `*`    |    0 or more times    |    `data!*`   | 'data', 'data!', 'data!!', 'data!!!' |
|    `+`    |    1 or more times    |     `lo+l`    |        'lol', 'lool', 'loool'        |

#### Challenge 9
Find all prices in the following test sentence.

In [41]:
test_string = """The iPhone X costs over $999, while the Android competitor comes in at around $550.
Apple's MacBook Pro costs $1200, while just a few years ago it was $1700.
A new charger for the MacBook costs over $8.
"""

['$999', '$550', '$1200', '$1700', '$8']

### The `re` module in Python

The regular expression syntax that we've seen so far covers most of the common use cases. Let's take a break from the syntax, and focus on Python's re module. It has some quirks that we should talk about, after which we'll get back to the syntax.

Up until now we've only used `re.findall`. This function takes two arguments, a `pattern` and a `text` to search through. It returns a list of all the substrings in `text` that follow `pattern`. 

Two other common functions are `re.match` and `re.search`. These take the same two arguments as `re.findall`. `re.search` looks through `text` for the **first** occurrence of `pattern`. `re.match` only looks at the start of `text`. Rather than returning a list, these two functions return a `match` object, which contains information about the substring in `text` that matches `pattern`. For example, it gives you the starting and ending index of the substring. If no such matching substring is found, they return `None`.

In [66]:
price_pattern = r'\$\d+'
test_string = """The iPhone X costs over $999, while the Android competitor comes in at around $550.
Apple's MacBook Pro costs $1200, while just a few years ago it was $1700.
A new charger for the MacBook costs over $80.
"""
m = re.search(price_pattern, test_string)
m

<re.Match object; span=(24, 28), match='$999'>

The `match` object has everal methods and attributes; the most important ones are `group()`, `start()`, `end()` and `span()`. `group()` returns the string that matched the regex, `start()` and `end()` return the relevant indicies, and `span()` returns the indicies as a tuple.

In [101]:
print(m.group())
print(m.start())
print(m.end())
print(m.span())

$999
24
28
(24, 28)


In general, I prefer just using `re.findall`, because I rarely need the information that `match` object instances give.

#### Challenge 10
Write a function called `first_vowel` that takes in a single word, and returns the first vowel. If there is no vowel in the word, it should return the string `"Hey, no vowel!"`.

In [68]:
print(first_vowel('hello'))
print(first_vowel('sky'))

e
Hey, no vowel!


### Replacing things

So far we've just been finding, but I promised you advanced "find and replace"! That's what `re.sub` is for. `re.sub` takes three arguments: a `pattern` to look for, a `replacement` string to replace it with, and a `text` to look for `pattern` in.

#### Challenge 11
Replace all the prices in the test string below with `"one million dollars"`.

In [102]:
test_string = """The iPhone X costs over $999, while the Android competitor comes in at around $550.
Apple's MacBook Pro costs $1200, while just a few years ago it was $1700.
A new charger for the MacBook costs over $80.
"""

"The iPhone X costs over one million dollars, while the Android competitor comes in at around one million dollars.\nApple's MacBook Pro costs one million dollars, while just a few years ago it was one million dollars.\nA new charger for the MacBook costs over one million dollars.\n"

So far we've used the module-level functions `re.findall` and friends. We can also `compile` a regex into a `pattern` object. The `pattern` object has methods with identical names to the module-level functions. The benefits are if you're searching over huge texts. It's entirely the same as what we've been doing so far so no need to complicate things. But you'll see it around so it's good to know about.

In [37]:
vowel_pattern = re.compile(r'[aeiou]')
test_string = 'abracadabra'
vowel_pattern.findall(test_string)

['a', 'a', 'a', 'a', 'a']

You might also want to experiment with `re.split`.

#### Challenge 12
You've received a problematic dataset from a fellow researcher, with some data entry errors/discrepancies. How would you use regular expressions to correct these errors?

1. Replace all instances of "district" or "District" with "County". 
2. Replace all instances of "Not available" or "[Name] looking up" with numeric codes.  

In [43]:
import os
DATA_DIR = '../data'
fname = os.path.join(DATA_DIR, 'usecase1/problem_dataset.csv')

with open(fname) as f:
    text = f.read()

# DO SOME REGEX MAGIC
# cleaned_text = ...


# with open("data/usecase1/cleaned_dataset.csv", "w") as f:
#     f.write(cleaned_text)


#### Challenge 13
Find all words in the following string about robots.

In [65]:
robot_string = '''Robots are branching out. A new prototype soft robot takes inspiration from plants by growing to explore its environment.

Vines and some fungi extend from their tips to explore their surroundings. 
Elliot Hawkes of the University of California in Santa Barbara 
and his colleagues designed a bot that works 
on similar principles. Its mechanical body 
sits inside a plastic tube reel that extends 
through pressurized inflation, a method that some 
invertebrates like peanut worms (Sipunculus nudus)
also use to extend their appendages. The plastic 
tubing has two compartments, and inflating one 
side or the other changes the extension direction. 
A camera sensor at the tip alerts the bot when it’s 
about to run into something.

In the lab, Hawkes and his colleagues 
programmed the robot to form 3-D structures such 
as a radio antenna, turn off a valve, navigate a maze, 
swim through glue, act as a fire extinguisher, squeeze 
through tight gaps, shimmy through fly paper and slither 
across a bed of nails. The soft bot can extend up to 
72 meters, and unlike plants, it can grow at a speed of 
10 meters per second, the team reports July 19 in Science Robotics. 
The design could serve as a model for building robots 
that can traverse constrained environments

This isn’t the first robot to take 
inspiration from plants. One plantlike 
predecessor was a robot modeled on roots.'''

['Robots', 'robot', 'robot', 'Robot', 'robots', 'robot', 'robot']

#### Challenge 14
We can use parentheses to match certain parts of a regular expression.

In [72]:
price_pattern = pattern = r'\$(\d+)\.(\d{2})'
test_string = "The iPhone X costs over $999.99, while the Android competitor comes in at around $550.50."
m = re.search(price_pattern, test_string)
dollars, cents = m.group(1), m.group(2)
print(dollars)
print(cents)

999
99


Use parentheses to group together the area code of a US phone number. Write a function called `area_code` that takes in a string, and if it is a valid US phone number, returns the area code. If not, it should return the string `"Hey, not a phone number!"`.

#### Challenge 15
Parentheses can also be used to group together characters in a regular expression so that metacharacters can apply to the entire group, not just a single character.

In [75]:
bat_pattern = r'Bat(wo)?man'
test_string = 'Batwoman, Batman and Robin are good friends.'
re.findall(bat_pattern, test_string)

['wo', '']

What went wrong? Well, parentheses have a double life in regular expression syntax. They are used to signal groups like in Challenge 14, but also to let metacharacters apply to those groups. Those two uses interfere with each other. If we want the `?` to apply to the whole `wo` sequence, but we want the whole substring that matches, we have to use a non-capturing group.

In [64]:
bat_pattern = r'Bat(?:wo)?man'
test_string = 'Batwoman, Batman and Robin are good friends.'
re.findall(bat_pattern, test_string)

['Batwoman', 'Batman']

Look back at challenge 13, where we looked for words to do with robots. We missed 'Robotics'. Using your newfound non-capturing group skills, correct this.

['Robots', 'robot', 'robot', 'Robotics', 'robots', 'robot', 'robot']

### Challenging challenges

#### Jane Eyre

I've downloaded the entire text of Charlotte Bront&euml;'s _Jane Eyre_ from [Project Gutenberg](https://www.gutenberg.org/). Imagine you're a literary scholar studying various aspects of Bront&euml;'s work. You might begin by extracting out various pieces of information from this book, and comparing them with other works. Here are some tasks you might need to do.

- Find all years (e.g. 1847).
- Find all direct quotes (text between quotation marks).
- Find all Mr.'s, Mrs.'s and and Misses (including the name that comes after it).
- Find all lines that use the same word at least twice.
- Write a function that takes in a plural noun and returns the singular version.
- Write a function that takes in a past tense verb and returns the base form.
- Find the relative frequencies of I, you, she, he, we and they.
- Find all URLs (before and after that actual text, there's some legal information from Project Gutenberg).
- Find all email addresses (see above)


In [69]:
fname = os.path.join(DATA_DIR, 'usecase3/jane_eyre.txt')
with open(fname) as f:
    text = f.read()

#### Reddit

I've also included a dataset (in csv format) from [Reddit](https://www.reddit.com/). Regular expressions are really useful for working with text data from the web. In the variable `questions`, you'll find all sorts of questions that people ask on the Internet. Find out:

- How many of them are "serious" (these include the word "serious" in some spelling variant)
- What words do people use before "of Reddit"?

In [76]:
import csv
fname = os.path.join(DATA_DIR, 'askreddit_2015.csv')
with open(fname) as f:
    reader = csv.reader(f)
    posts_with_header = list(reader)
    posts = posts_with_header[1:]
    questions = [p[0] for p in posts]