# Lecture 6.1 : Regular expressions

## Introduction

- Imagine we have downloaded the CA117 classlist from the web as an HTML
  document.  
- We would like to e-mail all students in the class and included in the
  HTML document is every student’s email address. So far so good.  
- Unfortunately, the document is “noisy”: 95% of it is HTML mark-up that
  obscures the e-mail addresses scattered throughout the document.  
- What can we do? We could manually scroll through the document looking
  for e-mail addresses and copy them to a list. That would be a tedious
  and error-prone task however. Is there an easier way?  
- It would be great if we could specify a “pattern” or “template” that
  matched and extracted just the information in the document that
  is of interest to us i.e. e-mail addresses.  
- If we could specify a general pattern that every e-mail address follows
  and then extract everything in the document that matches that pattern
  then we would have a list of just the required e-mail addresses.  
- Regular expressions allow us to do just that!  

## Regular expressions

- Regular expressions are used to specify patterns for entities we wish
  to locate and match in a larger string.  
- Examples might be dates, times, e-mail addresses, names, credit card
  numbers, social security numbers, directory paths, file names, etc.  
- Once we have defined a suitable regular expression we can ask questions
  such as the following: Is there a match for this pattern anywhere in the
  given string?  
- Regular expressions also allow us to efficiently find all substrings of
  a larger string that match the specified pattern.  
- This would seem ideal for our task: If we treat the HTML document as a
  single large string our task is to extract every substring from it that
  matches the pattern of an e-mail address.  

## Defining patterns

- The simplest of patterns takes the form of an ordinary string.  
- Below we define a regular expression `r'cat'` to match occurrences of
  the pattern ‘cat’ and we call this regular expression `p` (for
  pattern).  
- When defining a pattern we *always* precede it with ‘r’  in order to
  indicate to Python that this is a *raw string* (this prevents
  Python imposing its own interpretation on any special sequences that
  might arise in the pattern).  
- We match this pattern against the string `s` by calling
  `findall()`. The latter function returns a list of all substrings
  of `s` that match the defined pattern.  
- Two matches, as we might expect, are returned.  

In [3]:
# TODO
# We want to find all matches for a given pattern so import the required function
# import re
from re import findall

# We will look for matches in here
s = 'A catatonic cat sat on the mat. Catastrophe!'

# TODO
# Define our pattern in a raw string
p = r'cat'

# TODO
# Look for matches and print the result
print(findall(p, s))

['cat']


In [7]:
p = '\n'
p1 = r'\n'
print(p)
print(p1)



\n




## Character classes

- We can define *character classes* to be matched against.  
- The character class `[abc]` will match any *single* character a,
  or b or c.  
- The character class `[a-z]` will match any *single* character a
  through z.  
- The class `[a-zA-Z0-9]` will match any alphanumeric character.  
- Let’s use a character class to match instances of both ‘cat’ and ‘Cat’.  

In [8]:
from re import findall
# We will look for matches in here
s = 'A catatonic cat sat on the mat. Catastrophe!'

# TODO
# Match one of 'C' or 'c' followed by 'at'
p = r'[Cc]at'

# Match and print the result
print(findall(p, s))

['cat', 'cat', 'Cat']




## Character class negation

- We can negate character classes by preceding them with the ^ symbol.  

In [9]:
# We will look for matches in here
s = 'A catatonic cat sat on the mat. Catastrophe!'

# TODO
# Match anything except 'C' or 'c' followed by 'at'
p = r'[^Cc]at'

# Match and print the result
print(findall(p, s))

['tat', 'sat', 'mat']




## Sequences

- In addition to defining our own character classes we can call upon a
  predefined set of character classes when constructing regular expressions.  
- Such predefined classes are accessed using *special sequences*.  
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
  |Sequence|Matches|
  |:------:|:--------------------------------------------------------------------------:|
  |\d|Matches any decimal digit|
  |\D|Matches any non-digit|
  |\s|Matches any whitespace character (e.g. space, tab, newline)|
  |\S|Matches any non-whitespace character|
  |\w|Matches any alphanumeric character|
  |\W|Matches any non-alphanumeric character|
  |\b|Matches any word boundary (a word is an alphanumeric sequence of characters)|


>

In [10]:
# TODO
# Match any single digit
p = r'\d'

print(findall(p, '1 and 2 and 34'))
print(findall(p, 'No digits here'))

['1', '2', '3', '4']
[]


In [11]:
# TODO
# Match any single non-digit
p = r'\D'

print(findall(p, '1 and 2 and 34'))
print(findall(p, 'No digits here'))

[' ', 'a', 'n', 'd', ' ', ' ', 'a', 'n', 'd', ' ']
['N', 'o', ' ', 'd', 'i', 'g', 'i', 't', 's', ' ', 'h', 'e', 'r', 'e']


In [12]:
# TODO
# Match any single whitespace character
p = r'\s'

print(findall(p, '1 and 2 and 34'))
print(findall(p, 'No digits here'))
print(findall(p, '1\n2\t3'))

[' ', ' ', ' ', ' ']
[' ', ' ']
['\n', '\t']


In [13]:
# TODO
# Match any single non-whitespace character
p = r'\S'

print(findall(p, '1\n2\t3'))

['1', '2', '3']


In [14]:
# TODO
# Match any single alphanumeric character (digits and letters)
p = r'\w'

print(findall(p, '1 and 2 and 34'))
print(findall(p, '1\n2\t3'))

['1', 'a', 'n', 'd', '2', 'a', 'n', 'd', '3', '4']
['1', '2', '3']


In [15]:
# TODOfindall(p, '1 and 2 and 34'))
print(findall(p, '1\n2\t3'))
print(findall(p, '1 < 3'))
# Match any single non-alphanumeric character
p = r'\W'

print(

[' ', ' ', ' ', ' ']
['\n', '\t']
[' ', '<', ' ']



## Metacharacters

- Most characters simply match themselves.  
- Exceptions are metacharacters.  
- Metacharacters are special characters that do not match themselves but
  signal that something else should be matched.  
- Here are three common examples:  
    
    
    
    
    
    
    
    
  |Metacharacter|Matches|
  |:-----------:|:---------------------------------------:|
  |^|Matches the beginning of a string|
  |\$|Matches the end of a string|
  |.|Matches any character (except a new line)|

In [19]:
s = 'One rule of grammar says avoid ending a sentence with of'

# TODO
# Look for of's that break this rule
p = r'of$'

print(findall(p, s))

['of']


## A pattern that occurs once or zero times

- We can match a pattern once or zero times with the ? metacharacter.  
- Thus we can use ? to effectively make a pattern *optional*.  

In [None]:
# TODO
# Match US and IE spelling of colour
p = r'colou?r'

print(findall(p, 'In America they spell it colouur')) # only 0 or 1 won't match 2
print(findall(p, 'In America they spell it color'))
print(findall(p, 'Over here we spell it colour'))

[]
['color']
['colour']




## Repeating a pattern a fixed number of times

- With regular expressions we can match portions of a pattern multiple
  times.  
- We do so by specifying the number of required matches inside curly
brackets.  

In [30]:
# TODO
# Match a date of the form dd/mm/yy or dd-mm-yy
# p = r'[0-9][0-9][-/][0-9][0-9][-/][0-9][0-9]'
p = r'\d{2}[-/]\d{2}[-/]\d{2}'

print(findall(p, 'Christmas falls on 25-12-21'))
print(findall(p, "Valentine's Day is 14/02/21"))

['25-12-21']
['14/02/21']




## Groups

- If our pattern contains a *group* of characters that must be matched
some number of times then we need to enclose the pattern with `(?:`
on the left hand side and `)` on the right hand side.  

In [34]:
s = "St. Patrick's Day is March 17. April Fool's Day is April 1."

# TODO
# Let's try to match dates of the above form

p = r'(?:March|April) \d{1,2}'

print(findall(p, s))

['March 17', 'April 1']




## Repeating a pattern at least M and at most N times

- If we need to match a pattern at *least* a number of times *N* and
at *most* a number of times *N* then we write `{m, n}`.  

In [36]:
# TODO
# The more e's in your hey the happier you are
p = r'He{1,4}y!'

print(findall(p, 'Hey!'))
print(findall(p, 'Heey!'))
print(findall(p, 'Heeeey!'))
print(findall(p, 'Heeeeeeey!')) # Too happy to match

['Hey!']
['Heey!']
['Heeeey!']
[]




## Repeating a pattern zero or more times

- Some *metacharacters* allow us to handle an *arbitrary* number of
  matches.  
- One such metacharacter for specifying a repeated pattern is *.  
- The * metacharacter signifies that the preceding pattern can be matched
zero or more times.  

In [39]:
# TODO
# Match zero or more o's in the final Do
p = r'Yabba Dabba Do*!'

print(findall(p, 'Yabba Dabba Do!'))
print(findall(p, 'Yabba Dabba Doo!'))
print(findall(p, 'Yabba Dabba Doooooooooo!'))
print(findall(p, 'Yabba Dabba D!')) # We probably shouldn't match this

['Yabba Dabba Do!']
['Yabba Dabba Doo!']
['Yabba Dabba Doooooooooo!']
['Yabba Dabba D!']




## Repeating a pattern one or more times

- Another metacharacter for specifying a repeated pattern is +.  
- It signifies that the preceding pattern can be matched an arbitrary number
  of times but *must be matched at least once*.  
- Note the difference between * and +: with * the specified pattern may
not be present at all while with + the specified pattern must be present
at least once.  

In [41]:
# TODO
# Match one or more o's in the final Do
p = r'Yabba Dabba Do+!'

print(findall(p, 'Yabba Dabba Do!'))
print(findall(p, 'Yabba Dabba Doo!'))
print(findall(p, 'Yabba Dabba Doooooooooo!'))
print(findall(p, 'Yabba Dabba D!')) # We no longer match this

['Yabba Dabba Do!']
['Yabba Dabba Doo!']
['Yabba Dabba Doooooooooo!']
[]


## Exercise

- Extract the longest contiguous sequence of upper-case letters from a string

In [46]:
s = 'ABcdefQWERTYUIOPxyzZASDF'

# TODO
# Match a contiguous sequence of capital letters
p = r'[A-Z]+'

matches = findall(p, s)

print(max(matches, key=len))

QWERTYUIOP


## Exercise

- Map aaabbbcddd to 3a3b1c3d

In [56]:
# TODO
# Match one or more instances of any lower case letter
from string import ascii_lowercase

letters = list(ascii_lowercase)
# print(letters)

p = '+|'.join(letters) + '+'
# print(p)

s = 'aaabbbcddd'
matches = findall(p, s)
# print(matches)

print("".join([f"{len(s)}{s[0]}" for s in matches]))

3a3b1c3d
