https://realpython.com/regex-python/

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
import re

re.search(<regex>, <string>) scans <string> looking for the first location where the pattern <regex> matches. If a match is found, then re.search() returns a match object. Otherwise, it returns None.<br>
You’ll always need to import re.search() by one means or another before you’ll be able to use it.

In [3]:
s = "foo123bar"
re.search("123", s)

<re.Match object; span=(3, 6), match='123'>

In [4]:
if re.search("123", s):
    print(1)

1


In [5]:
re.search("[0-9][0-9][0-9]", s)

<re.Match object; span=(3, 6), match='123'>

In [6]:
re.search("[0-9][0-9][0-9]", "12foo34")

In [7]:
re.search("1.3", s) # . matches anything except a newline character

<re.Match object; span=(3, 6), match='123'>

\[\]: specifies a set of characters to match

In [8]:
re.search('ba[artz]', 'foobarqux') # matches ba followed by any one of artz
re.search('ba[artz]', 'foobazqux')
re.search('[a-z]', 'FOObar')
re.search('[A-Z][a-z]', 'FOObar')

<re.Match object; span=(3, 6), match='bar'>

<re.Match object; span=(3, 6), match='baz'>

<re.Match object; span=(3, 4), match='b'>

<re.Match object; span=(2, 4), match='Ob'>

[0-9a-fA-F] matches any hexadecimal digit character:

You can complement a character class by specifying **^** as the first character, in which case it matches any character that isn’t in the set. In the following example, [^0-9] matches any character that isn’t a digit:

If a ^ character appears in a character class but isn’t the first character, then it has no special meaning and matches a literal '^' character:

In [9]:
re.search('[^0-9]', '12345foo') # matches f

<re.Match object; span=(5, 6), match='f'>

In [10]:
re.search('[-abc]', '123-456')

<re.Match object; span=(3, 4), match='-'>

In [11]:
re.search("[9*]", "asdf*asfd98")

<re.Match object; span=(4, 5), match='*'>

\w matches any alphanumeric word character and is essentially shorthand for [a-zA-Z0-9_]<br>
\W is the opposite. It matches any non-word character and is equivalent to [^a-zA-Z0-9_]:<br>
\d matches any decimal digit character. \D is the opposite<br>
\s matches any whitespace character including newline. \S is the opposite<br>
[\d\w\s] matches any digit, word, or whitespace character.

In [12]:
re.search("\\\\", r"foo\bar")

<re.Match object; span=(3, 4), match='\\'>

##### Anchors
zero widtth matches. They don't match actual characters in the search string but dictate a particular location where a match must occur

^ or \A = the pattern must be present in the beginning of the string<br>
$ or \Z = the pattern must match at the end of the string

In [13]:
re.search("\Afoo", "barfoo") # no match
re.search("\Afoo", "foobar")
re.search("^foo", "foobar")
re.search("^foo", "barfoo") # foo must be present at not any odd place but at the beginning

<re.Match object; span=(0, 3), match='foo'>

<re.Match object; span=(0, 3), match='foo'>

In [14]:
re.search("bar\Z", "foobar")
re.search("bar$", "foobar")
re.search("$bar", "foobar") # no match
re.search("bar$", "foobar\n") # match because it also matches before a single newline
re.search("bar$", "foobar\n\n") # no match because 2 newlines

<re.Match object; span=(3, 6), match='bar'>

<re.Match object; span=(3, 6), match='bar'>

<re.Match object; span=(3, 6), match='bar'>

\b asserts that the regex parser’s current position must be at the beginning or end of a word. A word consists of a sequence of alphanumeric characters or underscores ([a-zA-Z0-9_]), the same as for the \w character class<br>
\B does the opposite of \b. It asserts that the regex parser’s current position must not be at the start or end of a word

\* matches repitition><br>
.* matches everything<br>
\+ is similar to * but the quantified regex must occur at least once

In [15]:
re.search("foo-*bar", "foo--bar")
re.search("foo-*bar", "foobar") # match
re.search("foo-+bar", "foobar") # no match

<re.Match object; span=(0, 8), match='foo--bar'>

<re.Match object; span=(0, 6), match='foobar'>

? matches 0 or one repitition of the preceding regex<br>

When used alone, \*, \+ and \? are greedy. Non greedy versions are \*?, \+? and \??

In [16]:
re.search("foo-?bar", "foobar")
re.search("foo-?bar", "foo-bar")
re.search("foo-?bar", "foo--bar")# no match

<re.Match object; span=(0, 6), match='foobar'>

<re.Match object; span=(0, 7), match='foo-bar'>

{m} matches exactly m repititions of the previous regex<br>
{m,n} matches any number of repitions of the preceding regex from m to n inclusive<br>
A non greedy version is {m,n}?. {m,n} will match as many characters as possible, {m,n}? will match as few as possible

In [19]:
re.search("x-{3}", "x--bar")

In [20]:
re.search("x-{3}", "x---bar")

<re.Match object; span=(0, 4), match='x---'>

In [23]:
re.search("x-{2,4}", "x--bar")

<re.Match object; span=(0, 3), match='x--'>

### Grouping constructs and backreferences
A group represents a single syntactic entity, additional metacharacters apply to the entire group as a unit<br>
Some grouping constructs also capture the portion of the search string that matches the subexpression in the group. Captured matches can be retreived later

bar+ would match with barr, bar, barrrrr, etc while (bar)+ matches with mutliple occurences of bar

In [25]:
re.search("(bar)", "foo bar baz") # a regex in () matches the contents of ()

<re.Match object; span=(4, 7), match='bar'>

In [28]:
re.search("(bar)+", "foo bar baz")
re.search("(bar)+", "foo barbar baz")
re.search("(bar)+", "foo barbarbarbar baz")
re.search("(bar)+", "foo barbar bar baz")

<re.Match object; span=(4, 7), match='bar'>

<re.Match object; span=(4, 10), match='barbar'>

<re.Match object; span=(4, 16), match='barbarbarbar'>

<re.Match object; span=(4, 10), match='barbar'>

In [32]:
re.search("(ba[rz]){2,4}(qux)?", "bazbarbazqux")
re.search("(ba[rz]){2,4}(qux)?", "barbar") # ? makes qux optional

<re.Match object; span=(0, 12), match='bazbarbazqux'>

<re.Match object; span=(0, 6), match='barbar'>

foo(bar)? is foo optionally followed by bar<br>
(foo(bar)?)+ one or more foo optionally followed by bar<br>
\d\d\d 0 or more occurences of 3 decimal characters

In [37]:
re.search("(foo(bar)?)+(\d\d\d)?", "foofoobar")
# one or more occurences of foo optionally followed by bar optionally followed by 3 decimal digit characters

<re.Match object; span=(0, 9), match='foofoobar'>

In [38]:
m = re.search("(foo(bar)?)+(\d\d\d)?", "foofoobar")

In [41]:
m.groups()

('foobar', 'bar', None)

In [49]:
m = re.search("(\w+),(\w+),(\w+)", "foo,quux,baz")

In [50]:
m
m.groups()

<re.Match object; span=(0, 12), match='foo,quux,baz'>

('foo', 'quux', 'baz')

In [59]:
m.group(2)
m.group(1) # not 0 indexed
m.group(3, 2, 1)
m.group(3, 2, 2)

'quux'

'foo'

('baz', 'quux', 'foo')

('baz', 'quux', 'quux')

In [57]:
m.group(0) # returns the entire match

'foo,quux,baz'

in r"(\w+),\1", \w is any word, the comma is a literal comma and \1 is a backreference to the first captured group and matches the same word again<br>
A backreference is generally written as a raw string or the interpreter may confuse it as an octal value. Numbered bakcrefrences are also not 0 indexed but 1 indexed

In [65]:
m = re.search(r"(\w+),\1", "foo,foo") # matches a word, followed by a comma followed by the same word again
m
m.group(1)

<re.Match object; span=(0, 7), match='foo,foo'>

'foo'

In [67]:
m = re.search(r"(\w+),\1", "qux,qux")
m
m.group(1)

<re.Match object; span=(0, 7), match='qux,qux'>

'qux'

In [71]:
re.search(r"(\w+),\1", "foo,qux") # doesn't match because the words around the comma are different

In [81]:
### ?P<name><regex>

In [76]:
# "(?P<w1>\w+),(?P<w2>\w+),(?P<w3>\w+)" is the same as "(\w+),(\w+),(\w+)" except that 
# the groups have symbolic names w1,w2,w3

In [80]:
m = re.search("(?P<w1>\w+),(?P<w2>\w+),(?P<w3>\w+)", "foo,quux,baz")
m.groups()
m.group("w1", "w3")

('foo', 'quux', 'baz')

('foo', 'baz')

In [82]:
# (?P=<NAME>) is a backreference that refers to a named group rather than a numbered group

In [85]:
m = re.search(r"(\w+),\1", "foo,foo")
m

<re.Match object; span=(0, 7), match='foo,foo'>

In [89]:
m = re.search(r"(?P<num>\d+)\.(?P=num)", "135.135")
m

<re.Match object; span=(0, 7), match='135.135'>

In [98]:
m = re.search("(\w+),(?:\w+),(\w+)", "foo,qux,bar")
m # it matches qux but doesn't capture it because of "?:" inside the group
m.groups()
m.group(1,2) # m.group(1,2,3) would give an error because only 2 elements are captured

<re.Match object; span=(0, 11), match='foo,qux,bar'>

('foo', 'bar')

('foo', 'bar')

In [105]:
re.search(r"^(###)?foo(?(1)bar|baz)", "###foobar")

# (?(1)bar|baz) matches against bar if group 1 exists and baz if it doesn't

<re.Match object; span=(0, 9), match='###foobar'>

In [109]:
re.search(r"^(###)?foo(?(1)bar|baz)", "###foobaz") # no match because it matches against bar as group 1 exists

In [110]:
re.search(r"^(###)?foo(?(1)bar|baz)", "foobar") # no match because it matches against baz as group 1 does not exist
# no  ### in the string

In [118]:
re.search(r"^(?P<ch>\W)?foo(?(ch)(?P=ch)|)$", "foo")

# If a non word character is before foo, it creates a group "ch", and then looks for the same match again

re.search(r"^(?P<ch>\W)?foo(?(ch)(?P=ch)|)$", "#foo#")
re.search(r"^(?P<ch>\W)?foo(?(ch)(?P=ch)|)$", "#foo$") # no match

<re.Match object; span=(0, 3), match='foo'>

<re.Match object; span=(0, 5), match='#foo#'>

Generally, it makes for more readable code if we just use multiple re.search instead of conditional regexes

### Lookahead and lookbehind assertions
These determine the success or failure of a regex match in python based on what is just behind or ahead of the parser's current position in the search string. Like anchors, lookahead and lookbehind are zero width assertions so they don't consume any of the search string<br>

\?\=\<*regex\> is a lookahead and \?\!\=\<*regex\> is negative lookahead<br>
\?\<=\<*regex\> is a lookbehind and \?\<!\<*regex\> is negative lookbehind

In [127]:
re.search("foo(?=[a-z])", "foobar") # ?= is a lookahead regex that specifies that what follows "foo" must be a 
# lowercase alphabet character. In this case, it's "b". The portion of lookahead isn't consumed.

# ?=<regex> is lookahead, ?!<regex> is a negative lookahead

<re.Match object; span=(0, 3), match='foo'>

In [125]:
re.search("foo(?=[a-z])", "foo1bar") # no match because 1 isn't [a-z]

In [133]:
re.search("(?<=[a-z])bar", "foobar")
re.search("(?<=[a-z])bar", "1foobar")
re.search("(?<=[a-z])bar", "1foo1bar") # no match

<re.Match object; span=(3, 6), match='bar'>

<re.Match object; span=(4, 7), match='bar'>