# What is regular expression?

Regular expression is a special sequence of string, used to match the strings we want to search.  

We can do fast text retrieval and text replacement using regular expression.   

Several scenarios that you can use regular expression:  

- To check a serial of number is telephone number or not?  
  
  
- To check a string is an email address or not?  
  
   
- To replace a word to another one in a text.  
  
  
- .......

In [6]:
# An expression of a simple regular expression.

# define a string
a = 'C|C++|Java|C#|Python|Javascript'

# To check if 'Python' is in this string
# We can do this using python method 'index()' or use membership operation 'in'
print('Python' in a)
print(a.index('Python') > -1)

# To use regular expression
import re  # import regular expression module. syntax: re.findall('pattern_to_search', str_name).
r = re.findall('Python', a)  # It returns a list, as findall() method may have multiple matches.
if len(r) > 0:
    print('Found string "Python"')
else:
    print('Not found')

# This example doesn't make any sense, as we used a constant 'Python' to search the exact word.
# We didn't use any abstract expression to do the match work.

True
True
Found string "Python"


In [7]:
# A more meaningful regular expression example
# To extract all the numbers in this string.
# Sure you can use 'for' loop to do this, but here we see how to achieve using regular expression

import re

a = 'C0C++8Java5C#7Python2Javascript'

r = re.findall('\d', a)  # in regular expression, '\d' represent the number 0-9.
print(r)


['0', '8', '5', '7', '2']


Regular expression is substantially a string. It is consisted of ordinary character and metacharacter.  
  
  
  
- In the first example, the every character in `Python` is called ordinary character.  
  
  
- In the second example, `\d` is called metacharacter.  
  
  
-  Regular expression is a combination of ordinary and meta- characters.  
  
  
- There are so many metacharacters that it is impossible to remember all. The correct way is to learn the way of using regular expression, think about the need of matching, and go the the metacharacter table to find information.

# Character set
## Normal character set

In [11]:
import re

# Define a string that consist of multiple words, with comma sepetated.
s = 'abc, acc, adc, aec, grafc, ahc'

# Now match the word that the middle character is 'c' or 'f'

r = re.findall('a[cf]c', s)
# Use [] as character set, to put all the characters you want to match in it.
# The ordinary characters 'a' and 'c' outside of the [] serves as the boundary,
# to make sure the matching word have 'a' on the left of the matching character
# as well as 'c' on the right of the matching charater.
# Not the whold word doesn't need to be exactly 'acc' or 'afc'
# like in the example, the 'afc' is found, where it comes from the word 'grafc'.
# But only 'afc' is returned as the matching result, not the whole word, as we set the regular expression to just return 3 characters.

print(r)

['acc', 'afc']


In [13]:
# To match the characters not in the character set, use ^ in front of the characters in the set.
r = re.findall('a[^chd]c', s)

print(r)

['abc', 'aec', 'afc']


In [14]:
# If the characters in the set are sequencial, you can use - to connect the fisrt and the last. instead of typing a long list of characters.
r = re.findall('a[c-f]c', s)

print(r)

['acc', 'adc', 'aec', 'afc']


- The characters in the character set `[]` has the relationship of `or`.  

## Abstract character set

- **This is not an official name.**  
  
    
- A short combination of characters to represent a group of characters.  
  
  
- E.g. `\d` represents the numbers 0-9, the same as `[0-9]`; `\D` represents non-number characters, the same as `[^0-9]  
  
  
- character set only matches single character

In [22]:
# \w for matching [A-Za-z0-9_], word characters

a = 'C0C++\n8_Java_5  C#7 Python\n2&*%$Javas\tcript\r'

r = re.findall('\w', a)

# Like \D, \W matches [^A-Za-z0-9_], non-word characters
r1 = re.findall('\W', a)

print(r)
print(r1)  # From the output we can see that space (' ') as well as new line character (\n), tab (\t), Enter (\r) are the non-word character.


['C', '0', 'C', '8', '_', 'J', 'a', 'v', 'a', '_', '5', 'C', '7', 'P', 'y', 't', 'h', 'o', 'n', '2', 'J', 'a', 'v', 'a', 's', 'c', 'r', 'i', 'p', 't']
['+', '+', '\n', ' ', ' ', '#', ' ', '\n', '&', '*', '%', '$', '\t', '\r']


In [23]:
# \s for matching empty characters 

a = 'C0C++\n8_Java_5  C#7 Python\n2&*%$Javas\tcript\r'

r = re.findall('\s', a)

print(r)  # Note characters like '&', '$' etc, is not matched. They are not empty characters.


['\n', ' ', ' ', ' ', '\n', '\t', '\r']


# Quantifier

## Basic usage

To match multiple characters for the set

In [24]:

a = 'C0C++8Java5C#7Python2Javascript'

# If we want to match 3 letters, use {} to specify the number after the character set.
r = re.findall('[a-z]{3}', a)
print(r)

['ava', 'yth', 'ava', 'scr', 'ipt']


In [26]:
# To match the whole word one by one

a = 'C0C++8Java5C#7Python2Javascript'

# Quantifier can also be a range, use ',' to seperate the two end.
r = re.findall('[A-Za-z]{3,6}', a)  # Match the word with 3-6 letters in length.
print(r)


['Java', 'Python', 'Javasc', 'ript']


**Question: in the above example, we want to match the letter combination with 3-6 in length. When python matches to `'Jav'`, this should be a successful match, as it has 3 letters, it is one of the patterns we defined. Why python didn't stop there, but continued to match the whole word `'Java'`?**

## Greedy and non-greedy

- Quantifier in regular expression has two modes of matching: Greedy and non-greedy.  
  
    
- Greedy mode is the default mode, it will match as much as possible in the range of the quantifier.  
  
  
- Non-greedy can be specified by add a `?` behind and outside of the quantifier.

In [27]:
# Non-greedy match
a = 'C0C++8Java5C#7Python2Javascript'

r = re.findall('[A-Za-z]{3,6}?', a)
print(r)

['Jav', 'Pyt', 'hon', 'Jav', 'asc', 'rip']


## Match 0, 1 or infinite times

**Other quantifiers**  

- `*`, representing match the character in front of `*` for 0 time or for infinite times.  
  
  
- `+`, representing match the character in front of `*` for 1 time or for infinite times.  
  
  
- `?`, representing match the character in front of `*` for 0 time or for 1 times.

In [32]:
# '*' to match 0 or infinite times. 

b = 'pytho0python1pythonn2'  # Note that there is difference in the word 'python'

# 'python*' means to match the word 'pytho' and also match 'n' after 'pytho' for 0 time or infinite times.
r = re.findall('python*', b)
print(r)

['pytho', 'python', 'pythonn']


In [33]:
# '+' to match 1 or infinite times.

b = 'pytho0python1pythonn2'  # Note that there is difference in the word 'python'

# 'python+' means to match the word 'pytho' and also match 'n' after 'pytho' for 1 time or infinite times.
r = re.findall('python+', b)
print(r)

['python', 'pythonn']


In [34]:
# '?' to match 0 or 1 times.

b = 'pytho0python1pythonn2'  # Note that there is difference in the word 'python'

# 'python*' means to match the word 'pytho' and also match 'n' after 'pytho' for 0 time or 1 time.
r = re.findall('python?', b)
print(r)

# Why the last one is matched and why it is matching 'python'?
# Because the word 'pythonn' includes 'python', and this word matches the 'python?' search,
# So it returns 'python' and removed the last 'n'

['pytho', 'python', 'python']


## The use of `?` 

- In the non-greedy matching, we use `?` matching to convert the quantifiers to non-greedy. Here `?` is a converter. 
  
  
- `?`, on the other hand, also represents to match a certain character 0 or 1 time. Here `?` is a quantifier.

In [38]:
# If we do a bit modification.

b = 'pytho0python1pythonn2'

# Here {1,2} is to match 'n' for 1 or 2 times in greedy mode. So it prints both 'python' and 'pythonn'
r = re.findall('python{1,2}', b)
print(r)

# Here we use '?' to convert greedy mode into non-greedy mode. So only 1 'n' will be printed.
r1 = re.findall('python{1,2}?', b)
print(r1)

['python', 'pythonn']
['python', 'python']


# Boundary matching

In [40]:
import re

# Define an variable, it is a serial of numbers, whose length is between 4 to 8.
f_id = '10000000001'

# We could do the matching using '\d{4,8}', but when the string has more than 8 numbers, it will only return 8 numbers.
# We can use a more efficient regular expression, letting it to match the boundary
r = re.findall('^\d{4,8}$', f_id)
print(r)


[]


- Boundary symbol `^` and `$`, use together to match the string from the left boundary and also from the right boundary at the same time. 

  
  
- `re.findall('^\d{4,8}$', f_id)` means to greedy match 4 to 8 numbers in the string `f_id`, the matching happen both from the left end and the right end, if the two matching doesn't find the same, then return empty list `[]`. 

In [44]:
f_id = '100000000001'

r = re.findall('000', f_id)
print(r)

r1 = re.findall('^000', f_id)
print(r1)

r2 = re.findall('000$', f_id)
print(r2)

['000', '000', '000']
[]
[]


-  `re.findall('^000', f_id)` matches `'000'` in `f_id` starting from the left boundary, it couldn't find, so return `[]`. 
  
  
- `re.findall('000$', f_id)` matches `'000'` in `f_id` starting from the right boundary, it couldn't find, so return `[]`.

# Group

**Use () to represent group, the quantifier after group takes the whole group of characters to do matching.**

In [54]:
import re

a = 'pythonpythonpythonpython'

# r = re.findall('python{3}', a)  
# This only match 'n' for 3 times.

r = re.findall('(python){3}', a flag)
# By adding () around 'python', it matches 'python' for 3 times.

print(r)

# Why the output is not ['pythonpythonpython']?


['python']


**Why the output is not `['pythonpythonpython']`?**

Here is the python documentation for `re.findall()`:
If one or more groups are present in the pattern, **return a list of groups**; this will be **a list of tuples if the pattern has more than one group**. Empty matches are included in the result.  
  
  


**Group and set**  
  
  
- Group use `()`, the characters in `()` has relationship of `and`.  
  
  
- Set use `[]`, the characters in `[]` has relationship of `or`.

# Flags

- flags defines matching modes. It is an argument that can be used in many methods in `re` module. 
  
  
  
- The expression’s behaviour can be modified by specifying a `flags` value. Values can be any of the following variables, combined using bitwise OR (the `|` operator).  
  
  
- `flags` has a default value `0`
  
  
flags:
- `re.A` or `re.ASCII`  
  

- `re.DEBUG`  
  
  

- `re.I` or `re.IGNORECASE`  
  
  

- `re.L` or `re.LOCALE`  
  
  

- `re.M` or `re.MULTILINE`  
  
  

- `re.S` or `re.DOTALL`  
  
  

- `re.X` or `re.VERBOSE`

In [68]:
# An example to use pass arguments to 'flags'.

import re

language = 'PythonC#Java\nPHP'

# 're.I' to ignore the case sensitivity.
r = re.findall('java', language, re.I)  # Or 'flags = re.I'.
print(r)

# '.' matches all characters except '\n'.
# So 'java.{1}' means to match any one character other than \n after 'java'
r1 = re.findall('java.{1}', language, re.I)
print(r1)  # Because '.' can't match `\n' after 'Java', so it prints empty.

# Pass multiple arguments to 'flags'. 're.S' makes '.' matches any character, including new line '\n'
r = re.findall('java.{1}', language, re.I | re.S)  # Or 'flags = re.I | re.S'.  '|' is the bitwise OR operator, meaning to apply both flags.
print(r)


['Java']
[]
['Java\n']


# `re.sub()` method

## Basic function

**To substitute the strings that are found successfully**  
  
  
**Syntax `re.sub(pattern, rep1, string, count=0, flags=0)`**   
  
    

`rep1` means the replacement string once found a match.  
  
    
    

`count` means how many time do you want the replacement done. Default value is `0`, means the replacement can be done as many times until the end of the string.  


In [72]:
import re

language = 'PythonC#Java\nC#PHPC#'

# Replace 'C#' with 'GO'
r = re.sub('C#', 'GO', language)  # It will return the whol string with the replacement.
print(r)

# Now we pass value '1' to 'count'
r1 = re.sub('C#', 'GO', language, 1)  # This will only do the replacement for only once.
print(r1)

# For simple replacment, we can also use python built-in method 'replace()'


PythonGOJava
GOPHPGO
PythonGOJava
C#PHPC#


## Pass function to `re.sub()` as argument
**Another powerful function of `re.sub()` is that the `rep1` argument could be a function.**

In [75]:
# Now we define a function and see.

def convert(value):
    pass

r = re.sub('C#', convert, language)
print(r)

# Why 'C#' are all gone in the output?

PythonJava
PHP


- When we pass a function to the argument `rep1`, like here, we pass `convert` function to `rep1`, every match string will be passed to the function's parameter.  
  
    

- The return value (**Note: THE RETURN VALUE**) of the function will be used as the replacement.  
  
    

- E.g. when `C#` is matched, `C#` itself will be passed to the parameter `value` of the function `convert`.  So now `value` will have a value `C#`.
  
    
- The return value of function `convert`, which is now `None`, will be used as the replacement. Therefore, all `C#` is replaced by `None`, which is empty.

In [78]:
# Now let the function do some operations on the parameter

# Firstly, let's see what is the value of the parameter 'value'
def convert(value):
    print(value)

r = re.sub('C#', convert, language)
print(r)

<re.Match object; span=(6, 8), match='C#'>
<re.Match object; span=(13, 15), match='C#'>
<re.Match object; span=(18, 20), match='C#'>
PythonJava
PHP


- We can see that the value of the parameter `value` of function `convert` are not simply string `C#`. They are a re.Math object, that not only has the matched string, but also has the information of position of each matched string in the original string.  
  
  
- For example, the first value `<re.Match object; span=(6, 8), match='C#'>`, means the matched characters `C#` has 6 characters in front of it and it ends as the 8th character.
  
  
- We can also see that `value` now has three values, which is making sense, as there are three `C#` matched, so the function `convert` is called three times.   

  
- To get the matched characters in the values of the re.Math object, we can use `parameter.group()` method. Here, it would be `value.group()`.

In [79]:
# now, we get the matched characters in the function.

def convert(value):
    matched = value.group()
    return '!!' + matched + '!!'

r = re.sub('C#', convert, language)
print(r)

Python!!C#!!Java
!!C#!!PHP!!C#!!


- Now we can see that all the `C#` in the original string has been dynamically replaced by `!!C#!!`.  
  
  
- This is very useful, because most of the time, we are not just replacing a constant. We ofter need to do some operations on the matches. And sometimes we need to do different replace operation based on different match result. Under these scenarios, it would be very difficult to implement this without the help of functions.

## An example of passing function to `re.sub()` as argument

In [81]:
# Aim: Find the numbers in string 's', for the numbers >=6, replace them to 9;
# for the numbers < 6, replace them to 0.

import re
s = 'A8C3721D86'

def convert(value):
    matched = value.group()
    if int(matched) >= 6:  # To values.group() returns string, so need to convert to interger to do comparison.
        return '9'         # We can't pass a number to the 're.sub()' function for replacement, it has to be a string.
    else:
        return '0'

r = re.sub('\d', convert, s)
print(r)

A9C0900D99


- It is a very classical programming designing thinking for a function is able to receive another function as one of the arguments. 
  
  
- Because, usually the program designer won't be able to decide for the users what is the logic of using this program. Therefore, the designer will leave a port (like `convert` in the example) and let the users to implement their own logic.  
  
  
- Now, the designer will require the user to pass a function as an argument, so that he can call that function in the program and provide a intermediate value for passing it to the user defined function.
  
  
- The designer doesn't care how the user manipulate the intermediate value to implement their logic. The only thing the designer needs is to receive a return value from the user-defined function and take as the argument into the program.  
  
  
- In another word, to use my program, define your own function, pass it to my program as argument. Then I will give you a intermediate value for you to implement your logic, and you give me your function return for me to make my program run properly.



# `re.search()` and `re.match()` methods

`match()` syntax: `re.math(pattern, string, flags=0)`  
If zero or more characters **at the beginning of string** match the regular expression pattern, return a corresponding **match object**. Return `None` if the string does not match the pattern; note that this is different from a zero-length match. 
  
  
`search()` syntax: `re.search(pattern, string, flags=0)`  
**Scan through string** looking for the **first location** where the regular expression pattern produces a match, and return a corresponding **match object**. Return `None` if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string. 
    
    
We can see the parameters of these two method are the same and also the same as `re.findall()`

In [88]:
import re

s = '1A83C72D1D8E67'

r = re.match('\d', s)
print(r)
print(r.span())

r1 = re.search('\d', s)
print(r1)
print(r1.group())

<re.Match object; span=(0, 1), match='1'>
(0, 1)
<re.Match object; span=(0, 1), match='1'>
1


- `search()` and `math()` returns match object, but no a direct string.  
  
  
- To access the matched string, we need to use `group()` method.  
  
  
- Can also get the match position using `span()` method.  

  
- Both method only match once, the match will stop once they find the first match.

# `group()`, `groups()` and Grouping
## Use `group()`

In [94]:
# Aim: to extract the characters between 'life' and 'python' in the following string.

import re

s = 'life is short, i use python'

r = re.search('life\wpython', s)
#print(r.group())
# This will give error, because, '\w' represent only 1 character, by defining the boundary, the return is 'None'

r1 = re.search('life\w*python', s)  # So use '\w*'
#print(r1.group())
# Still error, because '\w' doesn't reprent 'space' character (This is not empty string ' '). So it couldn't match.

r2 = re.search('life.*python', s)
print(r2.group())
# But this returns the whole string, not what we need in the aim.

r3 = re.search('life(.*)python', s)
print(r3.group(1))
#  Regular expression has groups, each expression is a group.
# We talked about grouping before, to use () to define a group.
# In this example, the whole expression 'life.*python' is a group, so we could use () on it '(life.*python)'
# But there is only one expression here, so () can be omitted.
# For matching, python will match the whole group of the expression, therefore, 'life' and 'python' are also returned.
# To remove the boundary, we can exclude them outside a group, 'life(.*)python'
# Now, the group is (.*), we can assign the group number in the group() to return the matches.
# We need do print(r.group(1)) to have the expected return. 
# group(0) is default argument, returning the match of the whole expression, but not the group

life is short, i use python
 is short, i use 


In [95]:
# The advantage of 're.findall()' in implementing this aim is that it doesn't need to pass the group() function

r4 = re.findall('life(.*)python', s)
print(r4)

[' is short, i use ']


## Multiple groups

In [98]:
# Aim: to extract the characters between 'life' and 'python' 
# as well as the characters between the two 'python' in the following string.

import re

s = 'life is short, i use python, i love python'

r = re.search('life(.*)python(.*)python', s)
print(r.group(0))
print(r.group(1))
print(r.group(2))

print(r.group(0,1,2))

print(r.groups())

life is short, i use python, i love python
 is short, i use 
, i love 
('life is short, i use python, i love python', ' is short, i use ', ', i love ')
(' is short, i use ', ', i love ')


- With two `()` in the expression, we have to groups to match.  

  
- We can use multiple group numbers in the `group()` method, it returns all the matches for each group in a tuple.  
  
  
- Accessing the group one by one only returns regular strings.  
  
  
- `groups()` will only returns the matching of the groups, also in a tuple, but doesn't return the whole string, which is returned by `group(0)`.

**There are many regular expression that other people write for different scenario, which you can use directly. We can use them, but also always to test and alway to learn from others.**