# String nightmares: A brief tour into the world of regular expressions

So, we have arrived at the lowpoint of the Python sessions, namely string matching and regular expressions. This is dangerous and frustrating territory since regular expressions are almost like a language on their own. Therefore, I will not even try be comprehensive here. Instead, we will play around with the vocabulary a little bit so that you can get a feel of when regular expressions are helpful and what you can do with them. 

## 1 Principle functions for matching

We have already worked with some string methods and we now turn to the topic of matching strings with regular expressions. Regular expressions define a string pattern that would like to match, given a specific source string. Before we can start properly, we need a string to work with. This time we take something more famous than a lowly xkcd poem. 

In [199]:
alice = '''If I had a world of my own, everything would be nonsense. 
        Nothing would be what it is, because everything would be what it isn't. 
        And contrary wise, what is, it wouldn't be. 
        And what it wouldn't be, it would. You see?'''

### 1.1 Match at the beginning of a string with `match`

The easiest method for string matching is the `match` function from the `re` module which we will import now. 

In [200]:
import re

It checks whether a string __starts__ with a specific pattern. In this case our pattern will just be `If` and the string `alice` will be our source. 

In [217]:
match_result_1 = re.match('If', alice)

In this case we have passed the pattern `If` directly as an argument. If we work on more complex tasks, we can also first compile a pattern. The following code does the same thing as the one above. 

In [203]:
my_pattern = re.compile('If') 
match_result_2 = my_pattern.match(alice)


So far, it might not be obvious what the difference between strings and regular expressions is but we will come to that soon. Bear with me for the moment, we come to that in the next section. First we will take a look at some other functions which are useful. 

### 1.2 Match everywhere with `search` 

The `search` function is a generalization of the `match` function: it searches for a pattern throughout the whole string and returns the first result. 

In [204]:
search_result_1 = re.search('world', alice)

The `search` function returns the first match in a string.

### A brief detour: match objects

You might have noticed that we have not looked at the results returned by the `match` and `search` functions. This is because they return strange objects. Let's take a look. 

In [220]:
print(match_result_1)

<_sre.SRE_Match object; span=(0, 2), match='If'>


In [221]:
print(search_result_1)

<_sre.SRE_Match object; span=(11, 16), match='world'>


As you can see the function return match objects, which give you the offset ranges of the matches found in the source string as well as the match returned. You can access them separately. 

In [223]:
match_result_1.span()

(0, 2)

In [225]:
alice[0:2]

'If'

In [227]:
match_result_1.group()

'If'

We can do the exact same thing for `search`.

In [207]:
search_result_1.span()

(11, 16)

In [229]:
alice[11:16]

'world'

In [222]:
search_result_1.group()

'world'

Now, what is returned if no match is found? Let's find out. 

In [210]:
search_result_2 = re.search('supercalifragilistic', alice)
print(search_result_2)

None


The function returned nothing which makes sense because there is no match. We still have a problem though when we use the `group` method. 

In [14]:
search_result_2.group()

AttributeError: 'NoneType' object has no attribute 'group'

How can you prevent Python from throwing an exception each time when no match is found and you use the `group function`. You can use the property that `None` is evaluated as `False` when used as a boolean and create a conditional. 

In [230]:
if search_result_2: 
    print(search_result_2.group())

In [231]:
if search_result_1: 
    print(search_result_1.group())

world


OK, now that we have settled this topic we go back to our matching functions. 

### 1.3 List of matches with `findall`

The `findall` function is a further generalization of the `search` function and it returns a list of all non-overlapping matches. 

In [17]:
findall_result_1 = re.findall('it', alice)
print(findall_result_1)

['it', 'it', 'it', 'it', 'it']


If there is no match, an empty list is returned. 

In [18]:
findall_result_2 = re.findall('frägellägel', alice)
print(findall_result_2)

[]


### 1.4 Split at pattern with `split`

The `split` function allows you to split a source string at the matches and returns a list of the resulting pieces. 

In [19]:
split_result_1 = re.split('it', alice)
split_result_1

['If I had a world of my own, everything would be nonsense. \n        Nothing would be what ',
 ' is, because everything would be what ',
 " isn't. \n        And contrary wise, what is, ",
 " wouldn't be. \n        And what ",
 " wouldn't be, ",
 ' would. You see?']

If no match is found, a list with one element, the original source string, will be returned. 

In [None]:
split_result_2 = re.split('smoogle', alice)
split_result_2

### 1.5 Replace matches in a string with `sub`

Sometimes, you might want to replace all substrings with match a certain pattern with another string. You can do this with the `sub` function. It returns a new string with the requested replacements. 

In [None]:
re.sub('i', 'ü', alice)

These are all already neat functions, but they become truly powerful when we combine them with regular expressions, which we will turn to next. 

## 2 Creating patterns

### 2.1 The basics

So far, this all seems not to be too intimidating. But we are also just starting out. Note that we cannot only pass strings but also more complex patterns to the functions above. Let's say we want to find all substrings consisting of a `w` and any other character. We can do this by adding a `'.'`.

In [20]:
re.findall('w.', alice)

['wo', 'wn', 'wo', 'wo', 'wh', 'wo', 'wh', 'wi', 'wh', 'wo', 'wh', 'wo', 'wo']

Cool, right? We have a bunch of those basic operators: 
+ `.`: any character except \n,
+ `*`: preceding character can appear a number of times (including zero times),
+ `?`: preceding character is optional.

In the following, we do some examples.

In [21]:
# an arbitary character + 'u'
source = "Humpty Dumpty"
re.findall('.u', source)

['Hu', 'Du']

In [22]:
# a 'u' optionally preceded by an 'H'
re.findall('H?u', source)

['Hu', 'u']

In [23]:
# sequences of one or more 'e'
source = 'Tweedle Dee and Tweedle Dum'
re.findall('ee*', source)

['ee', 'e', 'ee', 'ee', 'e']

You can already see, how powerful and horribly ugly these things can become. Let's take it up a notch. 

### 2.2 Special characters

Apart from the usual characters, you can use a number of special characters:
+ `\d`: a single digit 
+ `\D`: a single non-digit
+ `\w`: an alphanumeric character (digits, letters or underscore)
+ `\W`: a non-alphanumeric character
+ `\s`: a whitespace character
+ `\S`: a non-whitespace character
+ `\b`: a word boundary
+ `\B`: a non-word boundary

I know, whoever came up with should burn in a special kind of hell. Still, let's try to work with them. I am afraid, we cannot use Alice here, since she's not complicated enough. You might be happy though!

In [24]:
# split the address into its parts
address = 'Blümlisalpstrasse 10, 8006 Zürich'
# postal code
print(re.findall('\d\d\d\d', address))
# street and house number
print(re.findall('\w\w*\s\d\d*', address))
# city
print(re.findall('\s\D\D*', address))

['8006']
['Blümlisalpstrasse 10']
[' Zürich']


Sometimes we want to match on something but not have the whole match but a substring. For, example we might want my house number without the comma. Using parentheses we can organize regular expressions in capturing groups. 

In [None]:
my_match = re.search('(\d\d),', address)

In [None]:
print(my_match.group(0))

Calling the group element with the 0 gives you the whole match sequence. Calling it with a 1 gives you the match we are interested in. 

In [None]:
print(my_match.group(1))

### 2.3 Pattern specifiers

Admittedly, these patterns are not super-elegant yet. We need more specifiers: 
+ `(expr)`: `expr`
+ `expr1|expr2`: `expr1` or `expr2`
+ `^`: start of source string
+ `$`: end of source string
+ `expr?`: zero or one of `expr` 
+ `expr*`: zero or more of `expr`, as many as possible
+ `expr*?`: zero or more of `expr`, as few as possible
+ `expr+`: one or more of `expr`, as many as possible
+ `expr+?`: one or more of `expr`, as few as possible
+ `expr{m}`: `m` consecutive `expr`
+ `expr{m, n}`: `m` to `n` consecutive `expr`, as many as possible
+ `expr{m, n}?`: `m` to `n` consecutive `expr`, as few as possible
+ `[abc]`: `a`, `b`, or `c`
+ `[^abc]`: not `a`, `b`, or `c`
+ `expr(?= next)`: `expr` if followed by `next`
+ `expr(?! next)`: `expr` if not followed by `next` 
+ `(?<= prev) expr`: `expr` if preceded by `prev`
+ `(?<! prev) expr`: `expr` if not preceded by `prev`
+ a-z: a lowercase letter
+ A-z: a letter, lower or upper-case
+ b-w: a letter between `b` and `w`
+ 0-9: a number
+ 1-5: a number between `1` and `5`

This is why I think this chapter of our course is aptly named. You will not learn this quickly. But let's go through some examples. 

In [12]:
# choice between two expressions
source = "Humpty Dumpty"
re.findall('Humpty|Dumpty', source)

['Humpty', 'Dumpty']

In [13]:
# alternative expression
re.findall('[HD]umpty', source)

['Humpty', 'Dumpty']

In [14]:
# look for 'Dumpty' at the beginning of the string
re.findall('^Dumpty', source)

[]

In [15]:
# look for 'Dumpty' at the end of the string
re.findall('Dumpty$', source)

['Dumpty']

In [16]:
source = 'Tweedle Dee'
# find sequences of one or more 'e' character, as many as possible
re.findall('e+', source)

['ee', 'e', 'ee']

In [17]:
# find sequences of one or more 'e' character, as few as possible
re.findall('e+?', source)

['e', 'e', 'e', 'e', 'e']

In [18]:
# find sequences of one or two 'e' characters 
re.findall('e{1,2}', source)

['ee', 'e', 'ee']

In [19]:
# find sequences of two 'e' characters
re.findall('e{2}', source)

['ee', 'ee']

In [20]:
# alternative
re.findall('(ee){1}', source)

['ee', 'ee']

Let's finally do the address thing again: 

In [21]:
# split the address into its parts
address = 'Blüemlisalpstrasse 10, 8006 Zürich'
# street (alphanumeric character string at the begining of the line)
print(re.findall('^\w+', address))
# city (alphanumeric character string at the end of the line)
print(re.findall('\w+$', address))
# postal code (string of four digits preceded by comma and space)
print(re.findall('(?<=,\s)\d{4}', address))
# house number (string of digits, possibly followed by some letters e.g. 12A, followed by a comma)
print(re.findall('\d+\D*(?=,)', address))

['Blüemlisalpstrasse']
['Zürich']
['8006']
['10']


### 2.4 Tips

Regular expressions to retrieve patterns can be very powerful. However, the syntax is particularly cumbersome. You will most proobably forget most of the sintax very soon. Therefore, here are a few tips which might help you retrieve the most useful information with minimal effort.

If you are trying to retrieve information using regular expressions, most probably you do not know the content of that information but you know:
1. What comes before/after
2. Some parts of the string you want to extract
3. Some of both

Let's see one simple example.

In [23]:
# Example string
text = 'The street is Blüemlisalpstrasse, the number is 10, the postal code is 8006 Zürich.'

In general, you want to structure your query as follows:
```
string = "before(beginning[content]{length}end)after"

result = re.findall(string, text)
```

where
- **before**: is the part that you know comes before the string of interest
- **beginning**: is the beginning of the string of interest
- **content**: is the format of the content of the string (letters, numbers, both, symbols...). The simplest way to remember content for me is to stick to three most relevant pieces of syntax
    - `A-z` is letters
    - `0-9` is numbers
    - `\w` is anything but special characters
    - `^` stands for "except for"
- **length**: length of content. If absent is 1, if `+` is many
- **end**: is the end of the string of interest
- **after**: is the part that you know comes after the string of interest

This is the simplest, more general and intuitive structure (for me) that still captures most of the functionalities you will ever need to retrieve patterns in strings.

### 2.5 Examples

In [140]:
# Every character (first 10)
# Note that the square brackets around \w are superfluous here
re.findall('([\w])', text)[:10]

['T', 'h', 'e', 's', 't', 'r', 'e', 'e', 't', 'i']

In [191]:
# Every pair of characters (first 10)
re.findall('([\w]{2})', text)[:10]

['Th', 'st', 're', 'et', 'is', 'Bl', 'üe', 'ml', 'is', 'al']

In [190]:
# Every sequence of 10 characters 
re.findall('([\w]{10})', text)

['Blüemlisal']

In [188]:
# Every sequence up 10 characters (first 10)
re.findall('([\w]{,10})', text)[:10]

['The', '', 'street', '', 'is', '', 'Blüemlisal', 'pstrasse', '', '']

In [189]:
# Every sequence from 3 to 10 characters
re.findall('([\w]{3,10})', text)

['The',
 'street',
 'Blüemlisal',
 'pstrasse',
 'the',
 'number',
 'the',
 'postal',
 'code',
 '8006',
 'Zürich']

In [192]:
# Every word
re.findall('([\w]+)', text)

['The',
 'street',
 'is',
 'Blüemlisalpstrasse',
 'the',
 'number',
 'is',
 '10',
 'the',
 'postal',
 'code',
 'is',
 '8006',
 'Zürich']

In [160]:
# Everything that is not a space
re.findall('([^ ]+)', text)

['The',
 'street',
 'is',
 'Blüemlisalpstrasse,',
 'the',
 'number',
 'is',
 '10,',
 'the',
 'postal',
 'code',
 'is',
 '8006',
 'Zürich.']

In [161]:
# Everything that is not a comma or a dot
re.findall('([^.,]+)', text)

['The street is Blüemlisalpstrasse',
 ' the number is 10',
 ' the postal code is 8006 Zürich']

In [162]:
# Everything that is not a space, a comma or a dot
re.findall('([^ .,]+)', text)

['The',
 'street',
 'is',
 'Blüemlisalpstrasse',
 'the',
 'number',
 'is',
 '10',
 'the',
 'postal',
 'code',
 'is',
 '8006',
 'Zürich']

In [163]:
# Everything that is not a space, a comma or a dot and is at least 4 characters long
re.findall('([^ .,]{4,})', text)

['street', 'Blüemlisalpstrasse', 'number', 'postal', 'code', '8006', 'Zürich']

In [164]:
# First character after " is " 
re.findall(' is ([\w])', text)

['B', '1', '8']

In [165]:
# First 2 characters after " is "
re.findall(' is ([\w]{2})', text)

['Bl', '10', '80']

In [166]:
# First 3 characters after " is "
re.findall(' is ([\w]{3})', text)

['Blü', '800']

In [167]:
# First up to 3 characters after " is "
re.findall(' is ([\w]{,3})', text)

['Blü', '10', '800']

In [168]:
# First 3 to 4 characters after " is "
re.findall(' is ([\w]{3,4})', text)

['Blüe', '8006']

In [169]:
# Word after " i s", until special character (comma/dot)
re.findall(' is ([\w]+)', text)

['Blüemlisalpstrasse', '10', '8006']

In [170]:
# Multiple words after " is ", until special character (comma/dot)
re.findall(' is ([\w ]+)', text)

['Blüemlisalpstrasse', '10', '8006 Zürich']

In [171]:
# Text after " is "... ignores special characters :(
re.findall(' is ([A-z]+)', text)

['Bl']

In [172]:
# Digits after " is "
re.findall(' is ([0-9]+)', text)

['10', '8006']

In [173]:
# Digits and text after " is "
re.findall(' is ([A-z0-9]+)', text)

['Bl', '10', '8006']

In [174]:
# Digits and text and spaces after " is "
re.findall(' is ([A-z 0-9]+)', text)

['Bl', '10', '8006 Z']

In [175]:
# Everything after " is " that is not a space
re.findall(' is ([^ ]+)', text)

['Blüemlisalpstrasse,', '10,', '8006']

In [176]:
# Word before a comma
re.findall('([\w]+),', text)

['Blüemlisalpstrasse', '10']

In [177]:
# Text before a comma
re.findall('([A-z]+),', text)

['emlisalpstrasse']

In [178]:
# Numbers before a comma
re.findall('([0-9]+),', text)

['10']

In [179]:
# Word before a comma or a dot
re.findall('([\w]+)[.,]', text)

['Blüemlisalpstrasse', '10', 'Zürich']

In [180]:
# Any word that starts with "B"
re.findall('(B[\w]+)', text)

['Blüemlisalpstrasse']

In [181]:
# Any word that ends in "e"
re.findall('([\w]+e)', text)

['The', 'stree', 'Blüemlisalpstrasse', 'the', 'numbe', 'the', 'code']

In [182]:
# Any word that ends in "e" and is followed by a space
re.findall('([\w]+e) ', text)

['The', 'the', 'the', 'code']

In [183]:
# Any lowercase text that ends in "e" and is followed by a space
re.findall('([a-z]+e) ', text)

['he', 'the', 'the', 'code']

In [193]:
# Any lowercase text that starts with "t", ends in "e" and is followed by a space
re.findall('(t[a-z]+e) ', text)

['the', 'the']

Now you should be able to use regular expressions for pattern extraction in most scenarios. If you want to test your regular expression skills, you can do it here: http://play.inginf.units.it/#/.