<img src="https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/agods/nyp_ago_logo.png" width='400'/>

# Regular Expressions in Python

Regular expression is a tool for matching text by looking for a pattern. When we have the exact text, we can use it as a reference to find what we need. However, there are times when we do not know the exact text, for example, when checking whether a user's phone number or email has the proper format.

Regular expression can identify the presence or absence of text matching the pattern, and also split a pattern into one or more subpatterns, delivering the specific text within each.

Regular Expression is also know as regexes and is performed using the `re` module.

## Common reasons for using regular expressions

1. Data mining - extracting email, url, phone numbers etc from a large amount of text.

2. Validation - determining whether data, especially ones that are external and received by the system, is valid or not.

Let's first import the module necessary for using regular expression to begin.

In [1]:
import re

## Regular expression operations

For a list of the operations available in the `re` module, check out

https://docs.python.org/3/library/re.html#functions

We will use search() to take a regular expression and the source text to find matches. If there is no match, re.search will return None.

The 'r' preceding the string represents "raw". In this case, the \ character is not interpreted as an escape character.


```python
source_text = "Today is a great day to be learning about the great regular expressions"
re.search(r'great', source_text)
```

Try it in the cell below.

Observe that a `re.Match` object is returned.

At this point, it should be noted that a `match()` method exist and it only searches for a match that starts at the beginning of the string and not anywhere in a string.

In [2]:
re.match("c", "abcdef")    # No match

In [3]:
re.search("c", "abcdef")   # Match

<re.Match object; span=(2, 3), match='c'>

The Match object is important as it provides methods to inform us about the match.

The `group()` method returns a string with the text of the match while `groups()` and `groupdict()` can be used to call out subsections of the regular expression.

## findall and finditer

The two methods reurnt all non-overlapping mathces, including empty matches. The findall method returns a list while the finditer method returns a generator.

Both do not return a `re.Match` object.

The 2 methods are different from search() which only return the first match.

In [4]:
re.findall(r'a', 'zabcedfgabcdefg apple')

['a', 'a', 'a']

In [5]:
found = re.finditer(r'a', 'zabcedfgabcdefg apple')

In [6]:
for i in found:
  print(i)

<re.Match object; span=(1, 2), match='a'>
<re.Match object; span=(8, 9), match='a'>
<re.Match object; span=(16, 17), match='a'>


## Some basic examples

Regular expression by default are case-sensitive.

In [7]:
re.search(r"C", "abcdef")    # No match

In [8]:
re.search(r"Simple", "simple") # No match

In [9]:
re.search(r"Simple", "This is a Simple tale") # match word in a larger block of text"

<re.Match object; span=(10, 16), match='Simple'>

## Character Classes

Specify that a single character should match one of a set of possible characters, rather that just a single character.

Denote the character class by using square brackets and listing the possible characters within the brackets.

In [10]:
re.search(r"[Mm]erry", "Berry Merry")

<re.Match object; span=(6, 11), match='Merry'>

In [11]:
re.search(r"[Mm]erry", "berry merry")

<re.Match object; span=(6, 11), match='merry'>

In [12]:
re.search(r"[Mm]erry", "berry MERRY")

The use of the character classes did not make the entire word case-insensitive but it could be useful for words that have different spelling.

In [13]:
re.search(r"gr[ae]y", "gray color")

<re.Match object; span=(0, 4), match='gray'>

In [14]:
re.search(r"gr[ae]y", "graey color")

The character class will only match to 1 character. As a result, the previous regular expression has no match ('y' is compared to 'e' in the source text and that does not match.

## Ranges using hyphen (-)

With the large number of digits and letters, it is more convenient to use hyphen to denote ranges of digits and alphabets.

|Ranges| Match|
|---|---|
|`[0-9]` |   Match any digit|
|`[a-z]` |   Match any lowercase letters|
|`[A-Z]` |   Match any uppercase letters|
|`[A-Za-z]`|   Match any lowercase or uppercase letters|
|`[A-Za-z0-9_-]`|   Match any lowercase letters, uppercase letters, underscore or hyphen|

In [15]:
re.search(r"[0-9_-]", "gray-color")

<re.Match object; span=(4, 5), match='-'>

## Negation using ^ (circumflex, also called the hat operator)

Define a character class by the characters that do not occur.

|Ranges| Match|
|---|---|
|`[^0-9]` |   Match any non-digit|
|`[^a-z]` |   Match any non-lowercase letter|

In [16]:
re.search(r"[^0-9]", "gray-color")

<re.Match object; span=(0, 1), match='g'>

In [17]:
re.search(r"[^a-z]", "gray-color")

<re.Match object; span=(4, 5), match='-'>

When we are using negation, be aware that there are numbers, letters and symbols to be considered.

For example, the regular expression `e[^d]` will be finding the character 'e' followed by any character that is not a 'd'

In [18]:
re.search(r"e[^d]", "padded") #no match

In [19]:
re.search(r"e[^d]", "made") #no match

In [20]:
re.search(r"e[^d]", "pears")

<re.Match object; span=(1, 3), match='ea'>

In [21]:
re.search(r"e[^d]", "fade@gmail.com")

<re.Match object; span=(3, 5), match='e@'>

In [22]:
re.search(r"e[^d]", "be f g h")

<re.Match object; span=(1, 3), match='e '>

## Shortcuts

A shortcut (\w) can be used to match "any word character". In Python 3, it essentially matches any word character in any language, including digits, _ and -.

A shortcut (\d) can be used to match "any digit character". In Python 3, it matches digit characters in other languages.

A shortcut (\s) is used to match whitespace characters such as space, tab, and newline.

A shortcut (\b) is used to match a zero-length substring at the beginning or end of a word. It is a word boundary character shortcut.

In [23]:
re.findall(r'\w',"It is 3 o'clock.")

['I', 't', 'i', 's', '3', 'o', 'c', 'l', 'o', 'c', 'k']

In [24]:
re.findall(r'\d',"It is 3 o'clock.")

['3']

In [25]:
re.findall(r'\s',"It is 3 o'clock.")

[' ', ' ', ' ']

In [26]:
re.findall(r'\bi',"i am late. it is 3 o'clock.")

['i', 'i', 'i']

In [27]:
re.findall(r'\bi\b',"i am late. it is 3 o'clock.")

['i']

## Shortcuts negation

|Negated Shortcut| Match|
|---|---|
|`\W` |   Match any character other than word character|
|`\D` |   Match any character other than digit character|
|`\S` |   Match any character other than whitespace character|
|`\B`|   Match a zero-length substring that is not at the beginning or end of a word.|

In [28]:
re.findall(r'\W',"It is 3 o'clock.")

[' ', ' ', ' ', "'", '.']

In [29]:
re.findall(r'\D',"It is 3 o'clock.")

['I', 't', ' ', 'i', 's', ' ', ' ', 'o', "'", 'c', 'l', 'o', 'c', 'k', '.']

In [30]:
re.findall(r'\S',"It is 3 o'clock.")

['I', 't', 'i', 's', '3', 'o', "'", 'c', 'l', 'o', 'c', 'k', '.']

In [31]:
re.findall(r'\Bi',"I am counting pi on the bins. It is 3 o'clock.")

['i', 'i', 'i']

"I am count**i**ng p**i** on the b**i**ns. It is 3 o'clock."

In [32]:
re.findall(r'\Bi\B',"I am counting pi on the bins. It is 3 o'clock.")

['i', 'i']

"I am count**i**ng pi on the b**i**ns. It is 3 o'clock."

## Beginning and End of String

^ character is also used to match against the beginning of a string.

$ character is used to match agains the end of a string.

In [33]:
re.search(r"^Practice", "The Never-ending Practice") #no match

In [34]:
re.search(r"^Practice", "Practice, Never-ending")

<re.Match object; span=(0, 8), match='Practice'>

In [35]:
re.search(r"Practice$", "The Never-ending Practice")

<re.Match object; span=(17, 25), match='Practice'>

In [36]:
re.search(r"Practice$", "Practice, Never-ending") #no match

In [37]:
re.search(r"Practice$", "The Never-ending Practice ") #no match as end of string is a whitespace"

## Any character using (.)

This is used outside of the bracketed character class, and it is used to match any single character except the newline (\n)

In [38]:
re.search(r".ever", "The Never-ending Practice")

<re.Match object; span=(4, 9), match='Never'>

In [39]:
re.findall(r"..e", "The Never-ending Practice")

['The', ' Ne', 'r-e', 'ice']

## Optional Characters (?)

Use the optional character (?) when you expect the character, character class or unit of reference to occur once or zero times.

In [40]:
re.search(r"colou?r", "The colour")

<re.Match object; span=(4, 10), match='colour'>

In [41]:
re.search(r"colou?r", "The color")

<re.Match object; span=(4, 9), match='color'>

## Repetition

It is common to have repeating characters or character classess, and the token can be repeated by using {N}, with N representing the number of times the token should repeat.

In [42]:
re.search(r'[\d]{4}-[\d]{4}', '6655-2211 / Office Number')

<re.Match object; span=(0, 9), match='6655-2211'>

When the number of times to repeat the token is unknown, repetition ranges of {M, N} can be used where M is the lower bound and N the upper bound. The bound is inclusive.

For example, `[\d]{2,3}` will selectect two-digit or three-digit characters

In [43]:
re.findall(r'[\d]{2,3}', '1 42 123 6655 / Office Number')

['42', '123', '665']

If either 2 characters or 3 characters can be a valid match, the regular expression engine is "greedy" and it will match as many characters as possible. Therefore, 3 characters will be matched instead of 2.

In situations where this behaviour is not desirable, placing the optional character (?) immediately after the repetition operator would make the repetition "lazy". The engine will now match as few characters as possible to return a valid match.

In [44]:
re.findall(r'[\d]{2,3}?', '1 42 123 6655 / Office Number')

['42', '12', '66', '55']

For repetition ranges with no upper bound, leave out the upper bound value.

In [45]:
re.findall(r'[\d]{2,}', '1 42 123 6655 / Office Number')

['42', '123', '6655']

#### Shorthand

Shorthand characters are also used to designate repetitions.

|Shorthand| Situation|
|---|---|
|`+` |`{1,}`  one or more|
|`*` |`{0,}`  zero or more|

In [46]:
re.findall(r'[\d]{1,}', '1 42 123 6655 / Office Number')

['1', '42', '123', '6655']

In [47]:
re.findall(r'[\d]+', '1 42 123 6655 / Office Number')

['1', '42', '123', '6655']

## Grouping

Grouping through the use of parentheses () allow us to select individual group within the match.

For example:

In [48]:
match = re.search(r'([\d]{4})-([\d]{4})', '6655-2211 / Office Number')

In [49]:
match.group() #return entire match

'6655-2211'

In [50]:
match.groups() #return a tuple corresponding to each individual group

('6655', '2211')

In [51]:
match.group(2) #group are 1-indexed

'2211'

## Exercise

Create one regular expression to extract the following number formats

(65) 3867-5309

(+65) 38675309

+65 3867-5309

65 3867-5309

65-3867-5309

6538675309

65.3867.5309

65 3867 5309

3867 5309

`list_num = ['(65) 3867-5309','(+65) 38675309','+65 3867-5309','65 3867-5309','65-3867-5309','6538675309','65.3867.5309','65 3867 5309','3867 5309']`

To make it easier to test, we will create a compiled regular expression object that can be passed around easily.



```
regex = re.compile (r'[\d]')
regex.search('abc 123')
```




In [52]:
#todo: Exercise


In [53]:
#todo: Exercise


## Named Groups

Beside the positionally numbered groups, names can be assigned to the groups by using `?P<group_name>` immediately after the opening `(` character.

In [54]:
match = re.search(r'(?P<part_1>[\d]{4})-(?P<part_2>[\d]{4})','4533-1122')

In [55]:
match.group('part_1')

'4533'

In [56]:
match.group('part_2')

'1122'

`groupdict()` returns a dictionary instead of a tuple, and the dictionary keys correspond to the names of the groups.

If there is a mix of named groups and unnamed groups, the unnamed groups are not part of the dictionary returned by `groupdict()`

In [57]:
match.groupdict()

{'part_1': '4533', 'part_2': '1122'}

## Backreferences

Backreferences refer to a previous matched group within a regular expression, making the expectation that the same matched text should appear again.

Backreference numbered groups using \N where N is the group number. For example, \1 will match the first group and \2 the second group (up to 99 groups).

In [58]:
match = re.search(r'<([\w_-]+)>body</\1>','<p>body</p>')
match

<re.Match object; span=(0, 11), match='<p>body</p>'>

In [59]:
match = re.search(r'<([\w_-]+)>body</\1>','<p>body</table>') #no match
match

## Lookahead

Accept or reject a match based on the presence or absence of content after it, without making the subsequent content part of the match.

In [60]:
re.search(r'e(?!c)','peace')

<re.Match object; span=(1, 2), match='e'>

In [61]:
re.findall(r'e(?!c)','peace') # because the last e is not followed by a 'c', the match is successful.

['e', 'e']

Replace '!' with '=' for a positive lookahead.

In [62]:
re.search(r'e(?=a)','peace')

<re.Match object; span=(1, 2), match='e'>

In [63]:
re.findall(r'e(?=a)','peace') # last e is not matched and it is not followed by 'a', therefore the match is not successful.

['e']

## Case Insensitivity Mode

In [64]:
re.search(r'Happy', 'HAPPY birthday', re.IGNORECASE)

<re.Match object; span=(0, 5), match='HAPPY'>

## Dot Matching Newline Mode

The re.DOTALL flag causes the . character to match newline characters in addition to all other characters.

In [65]:
re.findall(r'.+', 'line1\nline2')

['line1', 'line2']

In [66]:
re.findall(r'.+', 'line1\nline2', re.DOTALL)

['line1\nline2']

## Multiline Mode

^ and $ characters match against the beginning or end of the string respectively and using re.MULTILINE will have the match be made against the beginning or end of any line within the string.

In [67]:
re.search(r'^line2','line1\nline2')

In [68]:
re.search(r'^line2','line1\nline2', re.MULTILINE)

<re.Match object; span=(6, 11), match='line2'>

## Combining Modes

When you need to setup multiple modes, join the flags using bitwise OR (|).

For example: `re.DOTALL | re.MULTILINE`

## Substitution

String replacement can also be done using regex. The `re.sub` takes 3 arguments, the regular expression, the replacement string and the source string being searched.

`re.sub` enables us to use backreferences from regular expression patterns within the replacement string.

In [69]:
re.sub(r'([\d]{4})-([\d]{4})', r'\1\2', '6655-2211') # remove the hyphen

'66552211'

## Final Notes:

If you are able to break down regular expressions, they will become more manageable and less intimidating.

Regular expression is useful for finding and validing data but be wary of using it.

Using direct string comparisons that are straight forward are usually preferred as the complexity introduced by regular expression may not be worthwhile. A complicated struture often makes it unsuitable to use regular expression.


### Exercise
1. Write a regular expression to match a sequence of lowercase letters, separated by an underscore. For examples: 'abc_efg', 'a_c'
2. Write a regular expression to replace the underscore separator of a sequence of lowercase letters (such as 'abc_efg', 'a_c') with %.

In [70]:
# todo: Exercise


In [71]:
# todo: Exercise
