- Title: Regular Expression in Python
- Slug: regular-expression-python
- Date: 2021-04-18 13:42:45
- Category: Computer Science
- Tags: programming, Python, regex, regular expression
- Author: Ben Du
- Modified: 2021-03-18 13:42:45



[Online Regular Expression Tester](https://regex101.com/)


1. The Python module `re` automatically compiles a plain/text pattern
  using `re.compile` and caches it,
  so there's not much benefit to compile plain/text patterns by yourself.

1. The regular expression modifier `(?i)` turns on case-insensitive matching.

2. `re.match` looks for a match only at the beginning of the string
    while `re.search` looks for a match anywhere in the string.
    Since the regular expression symbol `^` stands for the beginning of a string,
    you can prefix your regular expression with `^` 
    to make `re.search` look for a match only at the beginnin of the string.
    To sum up, 
    `re.search` is more flexible than `re.match` 
    and it is suggested that you always use `re.search` instead of `re.match`.

3. Passing `re.DOTALL` to the argument `flag` makes the dot (`.`) matches anything
    including a newline (by default the dot does not matches a newline).

1. `re.search` search for the first match anywhere in the string.

2. `re.match` search for the first match at the beginning of the string. 

3. `re.findall` find all matches in the string. 

4. `re.finditer` find all matches and return an iterator of the matches.

5. Passing `re.DOTALL` to the `flags` option make the dot matches anything including the newline.


In [1]:
import re

## re.compile

The compiled object is of type `re.Pattern` 
and has methods `search`, `match`, `sub`, `findall`, `finditer`, etc.

In [4]:
p = re.compile("\d{4}-\d{2}-\d{2}$")

In [5]:
type(p)

re.Pattern

In [8]:
[mem for mem in dir(p) if not mem.startswith("_")]

['findall',
 'finditer',
 'flags',
 'fullmatch',
 'groupindex',
 'groups',
 'match',
 'pattern',
 'scanner',
 'search',
 'split',
 'sub',
 'subn']

## re.sub

In [10]:
re.sub("\d{4}-\d{2}-\d{2}$", "YYYY-mm-dd", "Today is 2018-05-02")

'Today is YYYY-mm-dd'

In [2]:
re.sub("\s", "", "a b\tc")

'abc'

In [5]:
s = """this is 
    /* BEGIN{NIMA}
    what 
    ever
    END{NIMA} */
    an example
    """
print(re.sub("(?s)/\* BEGIN{NIMA}.*END{NIMA} \*/", "", s))

this is 
    
    an example
    


Make sure there is a space and only one space after a comma.

In [10]:
re.sub(', *', ', ', "ab,cd")

'ab, cd'

In [11]:
re.sub(', *', ', ', "ab,    cd")

'ab, cd'

## re.split

In [16]:
re.split("[+-/*]", "a-b/c*d")

['a', 'b', 'c', 'd']

In [17]:
re.split("[*+-/]", "a-b/c*d")

['a', 'b', 'c', 'd']

In [18]:
re.split("[+*-/]", "a-b/c*d")

['a', 'b', 'c', 'd']

`*` cannot be used after `-` in `[]` list as `-` has ambiguity here whether it is a literal minus sign or a range operator.

In [19]:
re.split("[+-*/]", "a-b/c*d")

error: bad character range +-* at position 1

## re.match

In [20]:
re.match("^\d{4}-\d{2}-\d{2}$", "2018-07-01")

<re.Match object; span=(0, 10), match='2018-07-01'>

In [21]:
re.match("\d{4}-\d{2}-\d{2}", "Today is 2018-07-01.")

## re.search

In [22]:
import re

re.search("^\d{4}-\d{2}-\d{2}$", "2018-07-01")

<re.Match object; span=(0, 10), match='2018-07-01'>

In [23]:
import re

re.search("\d{4}-\d{2}-\d{2}", "Today is 2018-07-01.")

<re.Match object; span=(9, 19), match='2018-07-01'>

In [6]:
re.search(",", "ab,cd")

<re.Match object; span=(2, 3), match=','>

In [7]:
re.search("\b,", "ab,cd")

In [8]:
re.search("\B,", "ab,cd")

In [None]:
re.search(",", "ab ,cd")

In [None]:
re.search("\b,", "ab ,cd")

In [None]:
re.search("\B,", "ab ,cd")

## re.Match.group / re.Match.groups

Matched strings in parentheses can be accessed using the method `Match.group` or `Match.groups`.

In [17]:
m = re.search("(\d{4}-\d{2}-\d{2})", "Today is 2018-07-01.")
m

<re.Match object; span=(9, 19), match='2018-07-01'>

In [18]:
m.groups()

('2018-07-01',)

In [19]:
m.group(0)

'2018-07-01'

## re.findall

Find all matched strings.

In [5]:
import re

s = 'It is "a" good "day" today.'
re.findall('".*?"', s)

['"a"', '"day"']

In [12]:
s = """this is 
    /* BEGIN{NIMA}
    what 
    ever
    END{NIMA} */
    an example
    """
re.findall("(?s)/\* BEGIN{NIMA}.*END{NIMA} \*/", s)

['/* BEGIN{NIMA}\n    what \n    ever\n    END{NIMA} */']

In [13]:
sql = """
    select ${cal_dt}, ${path} from some_table
    """
re.findall(r"\$\{\w+\}", sql)

['${cal_dt}', '${path}']

## The OR Operator |

1. `A|B` matches `A` or `B` where `A` and `B` can be any regular expressions.
    Notice that there is no necessary to put `A` and `B` into parentheses (groups) 
    when they are multi-character regular expressions.
    `(A)|(B)` is equivalent to `A|B` for any (valid) regular expressions.

2. `|` can be used in groups.

In [21]:
re.search("ab|bcd", "abcd")

<re.Match object; span=(0, 2), match='ab'>

In [24]:
re.search("a(b|b)cd", "abcd")

<re.Match object; span=(0, 4), match='abcd'>

In [23]:
re.search("(ab|bc)d", "abcd")

<re.Match object; span=(1, 4), match='bcd'>

## Lookahead and Lookbehind

Lookahead and lookbehind provides ways of matching patterns without consuming them in regular expressions.
This is extremely useful when you want to split a string according to a delimiter but want to keep the delimiter.

Split a string into lines but keep the trailing `\n`.

In [1]:
s = "line 1\nline 2\nline 3"
re.split("(?<=\n)", s)

['line 1\n', 'line 2\n', 'line 3']

Split a string into lines but keep `\n` in the beginning of each line.

In [2]:
s = "line 1\nline 2\nline 3"
re.split("(?=\n)", s)

['line 1', '\nline 2', '\nline 3']

## Escape & Non-escape

`{` and `}` need not to be escaped.

## References

- https://docs.python.org/3/library/re.html
- [Precedence of Operators in Regular Expression](http://www.legendu.net/misc/blog/precedence-of-operators-in-regular-expression/)
- [re — Regular expression operations](https://docs.python.org/3/library/re.html)
- [Online Regular Expression Tester](https://regex101.com/)