# Regular Expressions

A regular expression is a pattern that the regular expression engine attempts to match in input text. A pattern consists of one or more character literals, operators, or constructs. Regular expressions provide a powerful, flexible, and efficient method for processing text. The extensive pattern-matching notation of regular expressions enables you to
- quickly parse large amounts of text to find specific character patterns; 
- to validate text to ensure that it matches a predefined pattern (such as an e-mail address)
- to extract, edit, replace, or delete text substrings

In [1]:
import re

- `search()`: returns match object of the for the first location where the regular expression pattern produces a match, and return a corresponding match object; returns None if no position in the string matches the pattern.

- `match()`: similar to `search()`, but the match has to occur the beginning of the string and not anywhere in the string.

- `findall()`: returns all non-overlapping matches of a pattern in a string as a list of strings, or a list of tuples containing strings in case of groups; returns an empty list of no matches are found.

- `finditer()`: returns an iterator that iterates of all match objects (not strings) for a pattern in a string

- `sub()`: returns a string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the given replacement. returns the same input string if the pattern is not found.


The full of functions inluding the respective input parameters can be found here:
- https://docs.python.org/3.4/library/re.html#re.search

The method `show_matches()` is only there to simplify the following examples. It it's core, its simply uses the method `finditer()` to find matches based on a given expression in a given string. The rest of the methos only prints the result depending on the form of the matches.

In [2]:
def show_matches(pattern, string, flags=0):
    matches = re.findall(pattern, string, flags)
    if len(matches) == 0:
        print("No match found.", "\n")
    else:
        try:
            print(', '.join(matches), '\n')
        except:
            print(matches, '\n')

Let's use the next controversional statement for a series of examples.

In [3]:
sentence = "Having 1 cat is nice, but having 20 cats is crazy!"

## Basic patterns

- `a`, `X`, `9`, `<`: ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: `.` `^` `$` `*` `+` `?` `{` `[` `]` `\` `|` `(` `)` (details below)
- `.` (a period): matches any single character except newline `\n`
- `\w`: (lowercase w) matches a "word" character: a letter or digit or underbar `[a-zA-Z0-9_]`. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. `\W` (upper case W) matches any non-word character.
- `\b`: boundary between word and non-word
- `\s`: (lowercase s) matches a single whitespace character -- space, newline, return, tab, form `[ \n\r\t\f]`. `\S` (upper case S) matches any non-whitespace character.
- `\t`, `\n`, `\r`: tab, newline, return
- `\d`: decimal digit [0-9] (some older regex utilities do not support but `\d`, but they all support `\w` and `\s`)
- `^` = start, `$` = end: match the start or end of the string
- `\`: inhibit the "specialness" of a character. So, for example, use `\`. to match a period or `\\` to match a slash. If you are unsure if a character has special meaning, such as `@`, you can put a slash in front of it, `\@`, to make sure it is treated just as a character. 

In [4]:
show_matches(r'cat', sentence)
show_matches(r'cats', sentence)
show_matches(r'cat\s', sentence)
show_matches(r'\bcat\b', sentence)
show_matches(r'^cat', sentence)
show_matches(r'cat$', sentence)

cat, cat 

cats 

cat  

cat 

No match found. 

No match found. 



In [5]:
show_matches(r'Cat', sentence)
show_matches(r'Cat', sentence, flags=re.IGNORECASE)

No match found. 

cat, cat 



By default, every part of a pattern has to match for the whole pattern to match. The pipe operator `|` allows to combine two patterns by means of a logical OR:

In [None]:
show_matches(r'cats|cat', sentence)
show_matches(r'cat|cats', sentence)

Notice that the two expressions do noy yield the same result. In the second expression, the successful match happens already with the first three letters of "cats". If the words would share a common substring, the order wouldn't matter. for, example `cat|dog` and `dog|cat` would indeed yield the same results.

In [6]:
show_matches(r'\d', sentence)        # Same as: show_matches(r'[0-9]', sentence)
show_matches(r'\d\d', sentence)      # Same as: show_matches(r'[0-9][0-9]', sentence)
show_matches(r'\d\d\d', sentence)    # Same as: show_matches(r'[0-9][0-9][0-9]', sentence)

1, 2, 0 

20 

No match found. 



In [7]:
show_matches(r'\w', sentence)        # Same as: show_matches(r'[a-zA-Z0-9_]', sentence)
show_matches(r'\w\w', sentence)      # Same as: show_matches(r'[a-zA-Z0-9_][a-zA-Z0-9_]', sentence)
show_matches(r'\w\w\w', sentence)    # Same as: show_matches(r'[a-zA-Z0-9_][a-zA-Z0-9_][a-zA-Z0-9_]', sentence)

H, a, v, i, n, g, 1, c, a, t, i, s, n, i, c, e, b, u, t, h, a, v, i, n, g, 2, 0, c, a, t, s, i, s, c, r, a, z, y 

Ha, vi, ng, ca, is, ni, ce, bu, ha, vi, ng, 20, ca, ts, is, cr, az 

Hav, ing, cat, nic, but, hav, ing, cat, cra 



Notice that matches never overlap. For example, in the last example, one might assume something like: `Hav`, `avi`, `vin`, `ing`, `cat`, `nic`, `ice`, etc. Basic RegEx analyzer don not support overlapping matches, but there are RegEx packages for Python availble that do, e.g., `import regex`.

In [8]:
show_matches(r'Having', sentence)
show_matches(r'^Having', sentence)
show_matches(r'Having$', sentence)

Having 

Having 

No match found. 



In [9]:
print(re.sub(r'cat', 'dog', sentence))

Having 1 dog is nice, but having 20 dogs is crazy!


**Important:** Be always careful with what you're doing!

In [52]:
print(re.sub(r'cat', 'dog', 'The scatter plot shows all the categories'))

The sdogter plot shows all the dogegories


## Repetition patterns

- `+`: 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
- `*`: 0 or more occurrences of the pattern to its left
- `?`: match 0 or 1 occurrences of the pattern to its left 
- `{n}`: spcification of number of occurrences
- `{l,u}`: specification of a lower bound `l` and an upper bound `u`. The interval can be unbounded: `{l,}` or `{,u}`

In [53]:
show_matches(r'\d+', sentence)        # Same as: show_matches(r'[0-9]+', sentence)
show_matches(r'\d*', sentence)        # Same as: show_matches(r'[0-9]*', sentence)
show_matches(r'\d?', sentence)        # Same as: show_matches(r'[0-9]?', sentence)

1, 20 

, , , , , , , 1, , , , , , , , , , , , , , , , , , , , , , , , , , 20, , , , , , , , , , , , , , , ,  

, , , , , , , 1, , , , , , , , , , , , , , , , , , , , , , , , , , 2, 0, , , , , , , , , , , , , , , ,  



In [54]:
show_matches(r'\d{,3}', sentence)
show_matches(r'\d{1}', sentence)
show_matches(r'\d{1,1}', sentence)
show_matches(r'\d{2,3}', sentence)
show_matches(r'\d{2,2}', sentence)
show_matches(r'\d{1,}', sentence)

, , , , , , , 1, , , , , , , , , , , , , , , , , , , , , , , , , , 20, , , , , , , , , , , , , , , ,  

1, 2, 0 

1, 2, 0 

20 

20 

1, 20 



In [55]:
show_matches(r'\w+', sentence)        # Same as: show_matches(r'[a-zA-Z0-9_]+', sentence)
show_matches(r'\w*', sentence)        # Same as: show_matches(r'[a-zA-Z0-9_]*', sentence)
show_matches(r'\w?', sentence)        # Same as: show_matches(r'[a-zA-Z0-9_]?', sentence)
show_matches(r'[a-zA-Z]+', sentence)

Having, 1, cat, is, nice, but, having, 20, cats, is, crazy 

Having, , 1, , cat, , is, , nice, , , but, , having, , 20, , cats, , is, , crazy, ,  

H, a, v, i, n, g, , 1, , c, a, t, , i, s, , n, i, c, e, , , b, u, t, , h, a, v, i, n, g, , 2, 0, , c, a, t, s, , i, s, , c, r, a, z, y, ,  

Having, cat, is, nice, but, having, cats, is, crazy 



In [56]:
show_matches(r'cats+', sentence)
show_matches(r'cats*', sentence)
show_matches(r'cats?', sentence)

cats 

cat, cats 

cat, cats 



## Group extraction

Groups are very useful to "organize" a regular expression into different parts. While for a match all parts/groups must match, the individual groups are captured individually and can be accessed as such. As a result, the output is no longer a list of strings, one for each match, but a list of tuples. The number of elements in each tuples reflects the number of groups in the regular expression.

The following looks for all number-word pairs (e.g., "20 cats", but also "300km" since the whitespace between number and word is optional).

In [57]:
show_matches(r'(\d+)\s?(\w+)', sentence)

[('1', 'cat'), ('20', 'cats')] 



The following examples splits a sentence into different clauses by looking for commas, colons, and semicolons.

In [58]:
show_matches(r'([\w\s]+)[,;:]([\w\s]+)', sentence)

[('Having 1 cat is nice', ' but having 20 cats is crazy')] 



The following examples looks for email addresses in a string.

In [64]:
email_sentence = "You can contact me via test@example.org or demo.user.@example.org."

matches = re.findall(r'[\w.-]+@[\w.-]+\w', email_sentence)
matches = re.findall(r'((\w+[\w.])*@(\w+[.])*\w+)', email_sentence)

for m in matches:
    print(m)

('test@example.org', 'test', 'example.')
('demo.user.@example.org', 'user.', 'example.')


We can use groups to make it easy to individually get the account name (string for @) and the server name (string after @).

In [19]:
matches = re.findall(r'([\w.-]+)@([\w.-]+\w)', email_sentence)

for m in matches:
    print(m)
    print("-- account name: {}".format(m[0]))
    print("-- server name:  {}".format(m[1]))

('test', 'example.org')
-- account name: test
-- server name:  example.org
('demo', 'example.org')
-- account name: demo
-- server name:  example.org


Sometime, we not only want the matching substrings but also the position of each match in the input string. For example, in can be interesting how far apart two matches are in the string. To accomplish this, we can use the method `finditer()` which returns an iterator of match objects which in turn provide functions to get the location a match in terms of the positions of the first character anf the last character.

In [20]:
pattern = re.compile(r'([\w.-]+)@([\w.-]+)')

for m in pattern.finditer(email_sentence):
    print(m)
    print(m.group(), m.span())      # Same as print(m.group(0), m.span(0))
    print(m.group(1), m.span(1))
    print(m.group(2), m.span(2))
    print()

<_sre.SRE_Match object; span=(23, 39), match='test@example.org'>
test@example.org (23, 39)
test (23, 27)
example.org (28, 39)

<_sre.SRE_Match object; span=(43, 60), match='demo@example.org.'>
demo@example.org. (43, 60)
demo (43, 47)
example.org. (48, 60)



Note that `m` is not a string but a match object, since we are using `finditer()` instead of `findall()`.

Be default, groups are getting number with respect to the order of appearance. Thus to access the first group, we have to write `group(1)` and `span(1)`. `span(0)` or simply `span()` returns the complete match covering all groups and the "rest".

Groups can also be nested -- see the example later below. In case of nesting, the numbering derives from the order in which groups are opened by `(`.

### Named groups

Instead of using the implicit ordering of groups to access the (partial) matches, we can also give each group its own name with `(?P<account>...)`. This later allows to access the groups using the name, which avoids to keep track of the ordering -- but makes the regular expression more scary looking.

The example below again extracts all email address and differentiating between the account and server part of the address; see above. However, here we use named groups make printing the matches more intuitive.

In [21]:
pattern = re.compile(r'(?P<account>[\w.-]+)@(?P<server>[\w.-]+\w)')

for m in pattern.finditer(email_sentence):
    print(m)
    print(m.group(), m.span())
    print(m.group('account'), m.span('account'))
    print(m.group('server'), m.span('server'))
    print()

<_sre.SRE_Match object; span=(23, 39), match='test@example.org'>
test@example.org (23, 39)
test (23, 27)
example.org (28, 39)

<_sre.SRE_Match object; span=(43, 59), match='demo@example.org'>
demo@example.org (43, 59)
demo (43, 47)
example.org (48, 59)



### Grouping and Capturing

Parentheses serve two different purposes: grouping expressions and capturing the text that matches an expression. That is, by default, each matching group is captured and returned. However, sometimes groups are only used for matching by not capturing. Consider the following example:

In [22]:
gc_sentence = "Aircraft and airplane are synonyms, but a jet is a special kind of airplane."

From this sentence we want to extract all mentions of "aircraft", "airplane", and "jet". Making use of the syntactical similarities between "aircraft" and "airplane", we can accomplish this by using:

In [23]:
pattern = re.compile(r'(air(craft|plane)|jet)', flags=re.IGNORECASE)

for m in pattern.finditer(gc_sentence):
    print(m.group(1), m.span(1))
    print(m.group(2), m.span(2))

Aircraft (0, 8)
craft (3, 8)
airplane (13, 21)
plane (16, 21)
jet (42, 45)
None (-1, -1)
airplane (67, 75)
plane (70, 75)


Since we have two groups, we not only capture "aircraft" and "airplane" but also "craft" and "plane". This is not really what we want. To ged rid of is, we can switch off the caputering of the 2nd group with `?:` -- now the 2n group is only used for matching:

In [24]:
pattern = re.compile(r'(air(?:craft|plane)|jet)', flags=re.IGNORECASE)

for m in pattern.finditer(gc_sentence):
    print(m.group(1), m.span(1))
    # print(m.group(2), m.span(2)) # Now throws an error since the 2nd group has been "disabled"

Aircraft (0, 8)
airplane (13, 21)
jet (42, 45)
airplane (67, 75)


Side note: In this easy example, `(aircraft|airplane|jet)` would also work just fine

## Backreferences

Backreferences are a concept related to groups. They allow to refer to a group in the regular expressions. The common use case is to find repeated patters in a string.

In [25]:
br_sentence = "School is the coolest time of the day hahahahahaha."

In the following example, we are looking for words that contain repeated subtrings. According the to sentence above, that can be "school" where the repated substring is "o", or "hahahahahaha" where the repeated substring is "ha".

In [26]:
show_matches(r'\w*(\w)\1\w*', br_sentence)
show_matches(r'\w*(\w+)\1\w*', br_sentence)

o, o 

o, o, ha 



This didn't quite work since `findall()` captures only groups, if groups are used. To capture the full match one can use `finditer()` or keep `findall()` but change the expression a little bit.

In [27]:
pattern = re.compile(r'\w*(\w+)\1\w*')

for m in pattern.finditer(br_sentence):
    print(m.group(), m.span())

School (0, 6)
coolest (14, 21)
hahahahahaha (38, 50)


The alterantive is to use `findall()` and pack the whole expression into its own group. The only require change is that we need to replace `\1` by `\2` since we now have two groups -- actually nested groups, and the group that checks frou doubled characters is the inner group.

In [41]:
br_sentence = "This is a test."

show_matches(r'(\w*(\w)\2\w*)', br_sentence)
show_matches(r'(\w*(\w{2,})\2\w*)', br_sentence)
show_matches(r'(\b(\w)\w*\2\b)', br_sentence, flags=re.IGNORECASE)

for m in re.findall(r'(\w*(\w+)\2\w*)', br_sentence):
    print(m[0]) # We know that we are interested in the first group ==> first element in tuple

No match found. 

No match found. 

[('test', 't')] 



Just as an additional example, let's use named groups to make the access to the matches mroe intuitive.

In [33]:
pattern = re.compile(r'(?P<doubledword>\w*(\w+)\2\w*)')

for m in pattern.finditer(br_sentence):
    print(m.group('doubledword'), m.span('doubledword'))
    print()

School (0, 6)

coolest (14, 21)

hahahahahaha (38, 50)



### Replacing substrings using backreferences

The replacement string can also return backreferences. However, the syntax is slightly different to avoid potential ambiguities: to refer to the first group, instead of `\1` use `\g<1>`.

In [34]:
print(re.sub(r'(is)', 'was and \g<1>', br_sentence))

School was and is the coolest time of the day hahahahahaha.


The following substring replacement using `sub()` first finds double words and replaces them with just the first occurence.

In [35]:
print(re.sub(r'\b(\w+)\s+\1\b', '\g<1>', 'The the length of of a text is often defined by the the number of words.', flags=re.IGNORECASE))

The length of a text is often defined by the number of words.


We can also replace doubled words with the second occurence, but this is likely to cause more problems when the doubled words appear at the beginning of sentence with the first occurence being capitalized and the second one being lowercase. Not that we first have to change `\1` to `(\1)` to put the second occurence into its own group, otherwise we cannot refer to it with `\g<2>`.

In [36]:
print(re.sub(r'\b(\w+)\s+(\1)\b', '\g<2>', 'The the length of of a text is often defined by the the number of words.', flags=re.IGNORECASE))

the length of a text is often defined by the number of words.


## Lookarounds: Lookaheads & Lookbehinds

Lookarounds are assertions that look ahead or behind to ensure that a subpattern does or does not occur. Lookarounds match characters just like any other pattern, but then gives up the match, returning only the result: match or no match (hence an assertion).

With the 2 directions (ahead, behind) and the 2 types of assertions (match, no match), there a for types of lookarounds:

- `(?=)`: positive lookahead -- `A(?=B)` finds expression A but only when followed by expression B

- `(?!)`: negative lookahead -- `A(?!B)` finds expression A but only when *not* followed by expression B

- `(?<=)`: positive lookbehind -- `(?<=B)A` finds expression A but only when preceded by expression B

- `(?<!)`: negative lookbehind -- `(?<!B)A` finds expression A but only when *not* preceded by expression B


Lookaround expressions are very useful when there are several conditions. 

**Important:** The pattern of the lookbehinds need be of fixed length, that is, these patterns cannot any repitition specifier that allows for various number of characters. For example, `(?<!B+)A`, for some expression `B` would throw an error.

### Lookaheads

The following examples shows a simple positive lookahead: We are searching for all first name of people with the last names "Simpson". That means, we are looking for all words that are followed be "Simpsons" (but we do not care about "Simpson" itself)

In [79]:
lap_sentence = "The team consists of Homer Simpson, Barney Gumble, Monty Burns, Marge Simpson, Ned Flanders, and Lenny Lennard."

pattern = re.compile(r'\w+(?= Simpson)')

for m in pattern.finditer(lap_sentence):
    print(m.group())

Homer
Marge


### Lookbehinds

The following examples shows a simple positive lookbehind: We are searching for all amounts of money in SGD.

In [606]:
lbp_sentence = "For 5 years, the funding is $1,000,000.00 which converts to S$1,307,600.00."

# pattern = re.compile(r'[0-9,.]*\d')     # try the pattern without the lookbehind
pattern = re.compile(r'(?<=S\$)[0-9,.]*\d')

for m in pattern.finditer(lbp_sentence):
    print(m.group())

1,307,600.00


Note that we have to escape `$` to `\$` since `$` is a reserved symbol within regular expressions.

## More Examples

### Find all dates in a text

In [3]:
date_sentence = "the office will be closed from 20/03/2017 to 25.03.2017."

pattern = re.compile(r'\d+[/.]\d+[/.]\d+')

for m in pattern.finditer(date_sentence):
    print(m.group())

20/03/2017
25.03.2017


Don't forget that regular expressions only look for mattern. The expression above cannot truly find only real dates. For example, the epxression would also match `123.456.789` which might be a phone number. It is usually always possible to make regular expression more robust agains wrong matches, but the its complexity can quickly shoot up. In this example, we can use `r'\d{1,2}[/.]\d{1,2}[/.]\d{2,4}'` to ensure that `123.456.789` wouldn't match.

### Find all IATA flight numbers in a text

According to Wikipedia, IATA flight numbers are comprised of a 2-letter identifier of the airline and an assigned number -- fun fact: low numbers represent more prestigious flights. Officially, there's no whitespace between the airline code and the number, but many sources place one.

In [607]:
iata_sentence = "Was thinking about taking the SQ 326 but now I'm taking LH9765."

pattern = re.compile(r'\b[a-zA-Z]{2}\s?[\d]{1,4}\b')

for m in pattern.finditer(iata_sentence):
    print(m.group())

SQ 326
LH9765


### Change text from American English to British English (a bit)

In [608]:
ae_sentence = "Let's organize a meeting to list all commercialized products for the citizens."

In [609]:
# be_sentence = re.sub(r'ize', 'ise', ae_sentence, flags=re.IGNORECASE) # Check why this fails

be_sentence = re.sub(r'\b([\w]+)iz(ing|e[sd]?)?\b', '\g<1>is\g<2>', ae_sentence, flags=re.IGNORECASE) # Check why this fails

print(be_sentence)

Let's organise a meeting to list all commercialised products for the citizens.


### Password strength checker

Regular expression do not necessarily need to return a match. In this example, we use a regular expression to check the strength of passwords, i.e., a password needs to fulfil some requirements to be valid (see list below). A check for several conditions is often a good indicator that lookarounds are they best option.

- `(?=.*[a-z])`: at least 1 lowercase alphabetical character

- `(?=.*[A-Z])`: at least 1 uppercase alphabetical character

- `(?=.*[0-9])`: at least 1 numeric character

- `(?=.*[!@#\$%\^&\*])`: at least one special character (note the escaped reserved RegEx characters)

- `(?=.{8,})`: eight characters or longer

In [610]:
good_password = 'testTEST123#'
bad_password = 'testtest123#'   # no uppercase character

In [611]:
pattern = re.compile(r'^(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[!@#\$%\^&\*])(?=.{8,})')

print(pattern.match(good_password))
print(pattern.match(bad_password))

<_sre.SRE_Match object; span=(0, 0), match=''>
None


Not that even valid passwords don't return only an empty string. This is because the whole pattern is comprised of lookaheads, which are assertions (i.e., patterns that give up the match; see above). If you want get the password as a matched string you can change the last lookahaead `(?=.{8,})` to a normal pattern `.{8,}`.

## Summary

Regular expressions are a powerful and flexible tool to analyze text. This tutorial covered only a small subset of the capabilities of regular expressions.

- In general, the same task can be solved with different regular expressions, from simple to (overly) complex

- Regular expressions are "easy to learn but hard to master"

- Practically all modern programming languages support regular expressions

- RegeEx engines (e.g., Python's `re`) are not all created equal. Not all support all (advanced) features.