# Regular Expressions

***

Regular expressions describe patterns - in this case, patterns of text.

In Python, regular expressions are described by strings.

They're strings that describe patterns in strings.

We use raw python strings for regular expressions to prevent special characters being evaluated (https://stackoverflow.com/a/2081708).

In a sense, regular expressions represent sets (collections) of strings.

For example, the regular expression r'a+' represents the set of strings {'a', 'aa', 'aaa', ...}.

***

In [1]:
# In Python, the re module in the standard library provides regular expression (regex) functionality.
import re

<br>

## Examples

***

In [2]:
# Here's a regular expression that matches strings with three or more digits.
r = r'[0-9]{3,}'

In [3]:
# Test the regex against some strings.
if re.fullmatch(r, '123'):
    print("Match")
else:
    print("No match")

Match


In [4]:
# Test the regex against some strings.
if re.fullmatch(r, '20'):
    print("Match")
else:
    print("No match")

No match


In [5]:
# Test the regex against some strings.
if re.fullmatch(r, '123.53'):
    print("Match")
else:
    print("No match")

No match


In [6]:
# Where fullmatch tests the whole string, match will test the start of the string.
if re.match(r, '123.53'):
    print("Match")
else:
    print("No match")

Match


In [7]:
re.search(r, 'The number is 123.45, so it is. Another is 432.1')

<re.Match object; span=(14, 17), match='123'>

In [8]:
re.findall(r, 'The number is 123.45, so it is. Another is 432.1')

['123', '432']

In [9]:
re.sub(r, 'NUMBER HERE', 'The number is 123.45, so it is. Another is 432.1')

'The number is NUMBER HERE.45, so it is. Another is NUMBER HERE.1'

In [10]:
# Here's a regular expression that matches strings with three or more digits.
r = r'[0-9]{1,}\.?[0-9]*'

In [11]:
re.findall(r, 'The number 1 can be written as 1.0, 1.00, and believe it or not 0.9999...')

['1', '1.0', '1.00', '0.9999']

## Special Characters

***

Most characters in a regex pattern just mean themselves.

For instance 'a' is a regular expression meaning "match a single a".

However, some characters have a special meaning.

For instance, '.' is a regular expression meaning "match any character".

To match a full stop you have to escape it: '\.'

Depending on the context, a character might be special or not.

For instance, '12' means a '1' followed by a '2' but 'a{12}' means 'a' twelve times in a row.

***

In [12]:
# Non-special characters match themselves.
re.match(r'a', 'a')

<re.Match object; span=(0, 1), match='a'>

In [13]:
# Non-special characters match themselves - no return value means no match.
re.match(r'a', 'b')

In [14]:
# Special characters give regular expressions their power.
re.match(r'.', 'a')

<re.Match object; span=(0, 1), match='a'>

In [15]:
# Special characters give regular expressions their power.
re.match(r'.', 'b')

<re.Match object; span=(0, 1), match='b'>

In [16]:
# Special characters give regular expressions their power.
re.match(r'.', '.')

<re.Match object; span=(0, 1), match='.'>

In [17]:
# Special characters give regular expressions their power.
re.fullmatch(r'.', 'bb')

In [18]:
# Two regular expressions can be joined.
re.match(r'a.', 'abbbb')

<re.Match object; span=(0, 2), match='ab'>

In [19]:
# Two regular expressions can be joined.
re.match(r'a.', 'bbbbb')

<br>

## Sets

***

We can match any of a specified set or range of characters using square brackets.

We can match anything but a specified set too.

To match brackets in general, you have to escape them.

***


In [20]:
# Matching a or b.
re.findall(r'[ab]', "a")

['a']

In [21]:
# Matching a or b.
re.findall(r'[ab]', "b")

['b']

In [22]:
# Matching a or b.
re.findall(r'[ab]', "ab")

['a', 'b']

In [23]:
# Matching a or b.
re.findall(r'[ab]', "ac")

['a']

In [24]:
# Matching anything but a or b.
re.findall(r'[^ab]', "ac")

['c']

In [25]:
# Using ranges: a-z means the lowercase alphabet.
re.findall(r'[a-z]', "ac")

['a', 'c']

In [26]:
# Using ranges: A-Z means the lowercase alphabet.
re.findall(r'[A-Z]', "Aac")

['A']

In [27]:
# Using ranges: 0-9 means any digit.
re.findall(r'[0-9]', "1a2c3")

['1', '2', '3']

In [28]:
# Using ranges: 0-9 means any digit.
re.findall(r'[^0-9]', "1a2c3")

['a', 'c']

<br>

## Quantifiers

***

Quantifiers are used to specify repition.

For example, rather than typing 1,000 a's, we can just say "a one thousand times".

This would be written as 'a{1000}'.

In [29]:
# Astrisk means zero or more of - it's greedy by default.
re.match('.*', 'bbbbbbb')

<re.Match object; span=(0, 7), match='bbbbbbb'>

In [30]:
# Astrisk means zero or more of - it's greedy by default.
re.match('.*', '')

<re.Match object; span=(0, 0), match=''>

In [31]:
# Quantifiers apply to the immediately preceding item only.
re.match('a.*', 'abbbbb')

<re.Match object; span=(0, 6), match='abbbbb'>

In [32]:
# Plus means one or more.
re.match('.+', '')

In [33]:
# Plus means one or more.
re.match('.+', 'a')

<re.Match object; span=(0, 1), match='a'>

In [34]:
# Curly braces allow specific numbers.
re.fullmatch('a{3}', 'a')

In [35]:
# Curly braces allow specific numbers.
re.fullmatch('a{3}', 'aa')

In [36]:
# Curly braces allow specific numbers.
re.fullmatch('a{3}', 'aaa')

<re.Match object; span=(0, 3), match='aaa'>

In [37]:
# Curly braces allow specific numbers.
re.fullmatch('a{3}', 'aaaa')

In [38]:
# Curly braces allow specific numbers.
re.fullmatch('a{3,4}', 'aaaa')

<re.Match object; span=(0, 4), match='aaaa'>

In [39]:
# A stray comma means "or more".
re.fullmatch('a{3,}', 'aaaa')

<re.Match object; span=(0, 4), match='aaaa'>

In [40]:
# Quantifiers mix with sets.
re.findall(r'[0-9]+', "1a22c333")

['1', '22', '333']

In [41]:
# Note it doesn't have to be the same element of the set repeated.
re.findall(r'[0-9]+', "1a12c123")

['1', '12', '123']

<br>

## Groups

***

In [42]:
# Match ab three to five times.
re.fullmatch('(ab){3,5}', 'ababab')

<re.Match object; span=(0, 6), match='ababab'>

In [43]:
# Match ab three to five times.
re.fullmatch('(ab){3,5}', 'abababab')

<re.Match object; span=(0, 8), match='abababab'>

In [44]:
# Match ab three to five times.
re.fullmatch('(ab){3,5}', 'ababababab')

<re.Match object; span=(0, 10), match='ababababab'>

In [45]:
# Match ab three to five times.
re.fullmatch('(ab){3,5}', 'abababababab')

In [46]:
# Groups can be captured or not captured.
re.fullmatch('(?:ab){3,5}', 'ababababab')

<re.Match object; span=(0, 10), match='ababababab'>

In [47]:
# The difference between captured and non-captured is only relevant when backreferencing or substituting.
# The first capture is \1, the second \2, etc. Note the use of r'' for raw strings is essential.
re.fullmatch(r'([0-9])\1', '99')

<re.Match object; span=(0, 2), match='99'>

In [48]:
# The difference between captured and non-captured is only relevant when backreferencing or substituting.
re.fullmatch(r'([0-9])\1', '19')

In [49]:
# If we don't capture a group and then try to refer to it - that's a problem.
try:
    re.fullmatch(r'(?:[0-9])\1', '99')
except:
    print("A problem")

A problem


<br>

## Anchors

***

Anchors are special characters that don't match any character.

Rather they indicate a position.

***

In [50]:
# \b matches a word boundary.
re.findall(r'\band\b', "Ireland is and island and England is part of an island.")

['and', 'and']

In [51]:
# ^ matches the start of a line/string.
re.findall(r'^and', "and will be found but and won't.")

['and']

In [52]:
# $ matches the end of a line/string.
re.findall(r'and$', "This and won't be found but let's end the sentence in and and that will be found: and")

['and']

<br>

## Greedy

***

Quantifiers are generally greedy: they'll match as much as they can.

You can change that behaviour with a question mark.

Note that question marks have a few different meanings, depending on their contexts.

***

In [53]:
# Greediness doesn't make sense when matching HTML tags.
re.findall(r'<.*>', "<title>Hello, world</title>")

['<title>Hello, world</title>']

In [54]:
# Make him less greedy and he'll match each tag separately.
re.findall(r'<.*?>', "<title>Hello, world</title>")

['<title>', '</title>']

In [55]:
# Note there are oftenlots of ways to achieve the same result - here's a greedy one.
re.findall(r'<[^>]*>', "<title>Hello, world</title>")

['<title>', '</title>']

<br>

## Working with Data

***

Regular expressions are especially useful in manipulating large structured text files like CSV files.

Here we load the iris data set.

***

In [56]:
# For fetching files on the internet.
import requests

# Get the iris data set.
response = requests.get("https://raw.githubusercontent.com/ianmcloughlin/datasets/main/iris.csv")

In [57]:
# Print the first few lines.
list(response.iter_lines())[:5]

[b'sepal_length,sepal_width,petal_length,petal_width,species',
 b'5.1,3.5,1.4,0.2,setosa',
 b'4.9,3.0,1.4,0.2,setosa',
 b'4.7,3.2,1.3,0.2,setosa',
 b'4.6,3.1,1.5,0.2,setosa']

In [58]:
# Store all lines except for the header.
lines = list(response.iter_lines())[1:]

# Decode the lines.
lines = [l.decode('utf-8') for l in lines]

# Display every twentieth line.
lines[::20]

['5.1,3.5,1.4,0.2,setosa',
 '5.4,3.4,1.7,0.2,setosa',
 '5.0,3.5,1.3,0.3,setosa',
 '5.0,2.0,3.5,1.0,versicolor',
 '5.5,2.4,3.8,1.1,versicolor',
 '6.3,3.3,6.0,2.5,virginica',
 '6.9,3.2,5.7,2.3,virginica',
 '6.7,3.1,5.6,2.4,virginica']

In [59]:
# Match numbers.
re.findall(r'[0-9]+.[0-9]+', lines[0])

['5.1', '3.5', '1.4', '0.2']

In [60]:
# Identify each number on each line.
for line in lines[::20]:
    print(re.sub(r'([0-9]+\.[0-9]+)', r'NUMBER:\1', line))

NUMBER:5.1,NUMBER:3.5,NUMBER:1.4,NUMBER:0.2,setosa
NUMBER:5.4,NUMBER:3.4,NUMBER:1.7,NUMBER:0.2,setosa
NUMBER:5.0,NUMBER:3.5,NUMBER:1.3,NUMBER:0.3,setosa
NUMBER:5.0,NUMBER:2.0,NUMBER:3.5,NUMBER:1.0,versicolor
NUMBER:5.5,NUMBER:2.4,NUMBER:3.8,NUMBER:1.1,versicolor
NUMBER:6.3,NUMBER:3.3,NUMBER:6.0,NUMBER:2.5,virginica
NUMBER:6.9,NUMBER:3.2,NUMBER:5.7,NUMBER:2.3,virginica
NUMBER:6.7,NUMBER:3.1,NUMBER:5.6,NUMBER:2.4,virginica


In [61]:
# Identify each number and then the string.
for line in lines[::20]:
    print(re.sub(r'([0-9]+\.[0-9]+),([0-9]+\.[0-9]+),([0-9]+\.[0-9]+),([0-9]+\.[0-9]+),(.*)', r'NUMBER1:\1,NUMBER2:\2,NUMBER3:\3,NUMBER4:\4,STRING:\5', line))

NUMBER1:5.1,NUMBER2:3.5,NUMBER3:1.4,NUMBER4:0.2,STRING:setosa
NUMBER1:5.4,NUMBER2:3.4,NUMBER3:1.7,NUMBER4:0.2,STRING:setosa
NUMBER1:5.0,NUMBER2:3.5,NUMBER3:1.3,NUMBER4:0.3,STRING:setosa
NUMBER1:5.0,NUMBER2:2.0,NUMBER3:3.5,NUMBER4:1.0,STRING:versicolor
NUMBER1:5.5,NUMBER2:2.4,NUMBER3:3.8,NUMBER4:1.1,STRING:versicolor
NUMBER1:6.3,NUMBER2:3.3,NUMBER3:6.0,NUMBER4:2.5,STRING:virginica
NUMBER1:6.9,NUMBER2:3.2,NUMBER3:5.7,NUMBER4:2.3,STRING:virginica
NUMBER1:6.7,NUMBER2:3.1,NUMBER3:5.6,NUMBER4:2.4,STRING:virginica


In [62]:
# Move the string to the start of the numbers.
for line in lines[::20]:
    print(re.sub(r'([0-9]+\.[0-9]+),([0-9]+\.[0-9]+),([0-9]+\.[0-9]+),([0-9]+\.[0-9]+),(.*)', r'\5,\1,\2,\3,\4', line))

setosa,5.1,3.5,1.4,0.2
setosa,5.4,3.4,1.7,0.2
setosa,5.0,3.5,1.3,0.3
versicolor,5.0,2.0,3.5,1.0
versicolor,5.5,2.4,3.8,1.1
virginica,6.3,3.3,6.0,2.5
virginica,6.9,3.2,5.7,2.3
virginica,6.7,3.1,5.6,2.4


In [63]:
# Move the string to the start of the numbers and reverse the numbers.
for line in lines[::20]:
    print(re.sub(r'([0-9]+\.[0-9]+),([0-9]+\.[0-9]+),([0-9]+\.[0-9]+),([0-9]+\.[0-9]+),(.*)', r'\5,\4,\3,\2,\1', line))

setosa,0.2,1.4,3.5,5.1
setosa,0.2,1.7,3.4,5.4
setosa,0.3,1.3,3.5,5.0
versicolor,1.0,3.5,2.0,5.0
versicolor,1.1,3.8,2.4,5.5
virginica,2.5,6.0,3.3,6.3
virginica,2.3,5.7,3.2,6.9
virginica,2.4,5.6,3.1,6.7


***

## End