In [1]:
# Matching using the 'in' operator
s = "foo123bar"
"123" in s # True if substring is present

True

In [2]:
# Matching using string functions
s = "foo123bar"
s.find("123") # returns the start index of the substring if present

3

The above cases are simple, this works but not for more complex cases. Possible we may have to write verbose unecessary code. 
The regex comes in handy for solving cases. No need to reinvent the wheel.

# The re Module

re module is a standard library module, it provides this Regex functionality.

## `re.search(<regex/>, <string>)`

The search functon scans a string for the first location of regex match, if found it returns match object or None if note found.
It is the most simple of all functions in the re module.

### Example: Simple case that work with the 'in' operator and string functions


In [3]:
# character by character match
import re
s = 'foo123bar'
re.search("123", s) # returns a match object if substring is found otherwise None

<re.Match object; span=(3, 6), match='123'>

In [4]:
# match object is truthy, so it can be used in a Boolean context like with conditionals
if re.search("123", s):
    print("Found a match.")
else:
    print("No match found.")

Found a match.


In [5]:
# match object is truthy, so it can be used in a Boolean context like with conditionals
if re.search("1224", s):
    print("Found a match")
else:
    print("No match found")

No match found


## Regex Metacharacters
We can go a step higher with regex, instead of matching character by character we can use metacharacters to create flexible and 
generic patterns to be matched in the input strings. Application can include input validation and more where can define generic panel for emails address, phone numbers etc such user must follow the requirements

All regex metacharaters:
1. dot(.) - matches any single characters except a newline
2. Caret(^) - anchors a match at the begining of a string and complements a character class defined by angle brackets
3. Dollar($) - anchors a match at the end of a string.
4. Asterisk(*) - Matches zero or more repetitions.
5. Plus(+) - Matches one or more repetitions.
6. Question Mark(?) - Matches zero or one repetiton(non-greedy version of * and +).
7. Curly braces - Matches an explicitly specified number of repetitions in between the braces
8. Backslash(\) - Escapes metacharacters meaning in a regex defination but ensuring meaning are lite
9. Angle brackets ([]) - character class
10. Pipe (|) -
11. Brackets () - Creates a group.
12. :, #, =, ! - Specilized group
13. <> - creates named group

## Regex metacharacters that matches a single character

### 1.  Square brackets ([])
-  Square brackets [] defines a character class.  A character class matches any single character in the string

In [6]:
s = 'foo123bar'
re.search('[0-9][0-9][0-9]', s) # generic pattern to match any three consecutive decimal digits

<re.Match object; span=(3, 6), match='123'>

In [7]:
# works for any other string to find any three consecutive decimal digits
s = 'foo456bar'
re.search('[0-9][0-9][0-9]', s) # [0-9] is a range that matches a single character on the 0 to 9 range. 

<re.Match object; span=(3, 6), match='456'>

In [8]:
# None is returned if the string doesn't contain the pattern
s = '12foo46bar'
print(re.search('[0-9][0-9][0-9]', s)) 

None


In [9]:
# alphabet range
s = 'Goodluck'
re.search('Good[a-z][a-z][a-z][a-z]', s) 

<re.Match object; span=(0, 8), match='Goodluck'>

In [10]:
# specific character match
# The pattern finds whether the name has a single character which is 'k', 'c' or 'q' anywhere in the string
s = "kelvin"
re.search("[kcq]",s) 

<re.Match object; span=(0, 1), match='k'>

In [11]:
# matches single first occcurence of any of the characters in the character class(leftmost possible match).
s = "celvink"
re.search("[kcq]",s) 

<re.Match object; span=(0, 1), match='c'>

In [12]:
# muitiple range character class.
s = "---a0---"
re.search("[0-9a-zA-F]",s) # matches any hexadecimal digit character in the string

<re.Match object; span=(3, 4), match='a'>

As observed, this would be difficult when using the 'in' operator or string methods.

We can go further to complement a character class with other metacharacters

## Complementing with caret ^

    The caret (`^`) negates a character class, making it match anything *except* the specified characters.

In [13]:
# Complementing a character class with caret ^
re.search('[^0-9]', '12345foo') # matches any character that is not a decimal digit 

<re.Match object; span=(5, 6), match='f'>

In [14]:
# if the caret appears anywhere in the regex pattern except the first character, it has no special meaning, it is taken literally.
re.search('[#:^]', 'foo^bar:baz#qux') # caret appears found as the 4th character in the string.

<re.Match object; span=(3, 4), match='^'>

### Metacharacters lose meaning inside a character class

Metacharacters placed inside a character class have no special meaning and only match as themselves as literal characters.

In [15]:
re.search('[)*+|]', '123*456') # '*' is found in the string as literal at index 3

<re.Match object; span=(3, 4), match='*'>

In [16]:
re.search('[)*+|]', '123+456') # '+' is found in the string as literal at index 3

<re.Match object; span=(3, 4), match='+'>

In [17]:
re.search('[)*^|]', '123+^456') # '^' is found in the string as literal at index 3

<re.Match object; span=(4, 5), match='^'>

### 2.  dot (.) (aka  wildcard)
-  The dot matches any single charater except a newline.

In [18]:
s = 'foo123bar'
re.search('1.3', s) # matches because 1 & 3 and there is a character in between which is not a newline

<re.Match object; span=(3, 6), match='123'>

In [19]:
s = 'foo13bar'
print(re.search('1.3', s)) # return None because there is no character between 1 & 3 in the string

None


In [20]:
s = 'foo1\n3bar'
print(re.search('1.3', s)) # return None because the character between 1 & 3 is a newline
# we can force (.) match newline if we need to

None


### 3. \w  (lowercase 'w')
   Matches any alphanumeric word character. 
   
   These includes: uppercase and lowercase, digits and the underscore.
   
   If we used a character class we would define as [a-zA-Z0-9_] 

In [21]:
# using \w
str = "#(.a$@&"
re.search(r'\w', str)

<re.Match object; span=(3, 4), match='a'>

In [22]:
# same result if we use a character class
str = "#(.a$@&"
re.search('[a-zA-Z0-9_]', str)

<re.Match object; span=(3, 4), match='a'>

### 4. \W (uppercase 'W')

    Matches all non- word characters, equivalent to [^a-zA-z0-9_] character class


In [23]:
# using \W
re.search(r'\W', 'a_1*3Qb') # find '*' at index 3 because it is non-word character

<re.Match object; span=(3, 4), match='*'>

In [24]:
# using [^a-zA-z0-9_] character class
re.search('[^a-zA-Z0-9_]', 'a_1*3Qb') # find '*' at index 3 because it is non-word character

<re.Match object; span=(3, 4), match='*'>

### 5. \d
    Matches decimal digit characters or [0-9] character class


In [25]:
re.search(r"\d", "Kelvin6") # Finds '6' at index 6

<re.Match object; span=(6, 7), match='6'>

### 6. \D

    Matches character that are not decimal digit or [^0-9] character class

In [26]:
re.search(r"\D", "Kelvin6")  # Finds 'K' at index 0, the leftmost non-decimal digit character

<re.Match object; span=(0, 1), match='K'>

### 7. \s
   Matches a whitespace.

In [27]:
re.search(r'\s', 'foo\nbar baz') # unlike dt(.) \s matches newline cause it is considered a whitespace

<re.Match object; span=(3, 4), match='\n'>

### 8. \S

    Matches everything ecxept whitespace.

In [28]:
re.search(r'\S', '  \n foo  \n  ') # matches f because it the leftmost non-white space character

<re.Match object; span=(4, 5), match='f'>

### 9. Using \w, \W, \d, \D, \s  and \S in a character class

    Follows concept as discussed above.

In [29]:
re.search(r'[\d\w\s]', '---3---') # matches single digit or word or whitespace character

<re.Match object; span=(3, 4), match='3'>

In [30]:
re.search(r'[\d\w\s]', '---a---') # matches single digit or word or whitespace character

<re.Match object; span=(3, 4), match='a'>

In [31]:
re.search(r'[\d\w\s]', '------') # matches single digit or word or whitespace character

In [32]:
# can be shorted as \w also included \d
re.search(r'[\w\s]', '---3---') # matches single digit or word or whitespace character

<re.Match object; span=(3, 4), match='3'>

### Escaping metacharacter

Sometimes we may want the metacharacters to match as literal characters not have special meaning

In [33]:
# Escaping character

# 1. Escape hyphen
# in search cases it should be defined as the first or last character or escape it
# define hyphen as first character
re.search('[-abc]', '123-456') # match found at index 3

<re.Match object; span=(3, 4), match='-'>

In [34]:
# define hyphen as last character
re.search('[abc-]', '123-456') # match found at index 3

<re.Match object; span=(3, 4), match='-'>

In [35]:
# escape hyphen using '\' in the pattern
re.search(r'[ab\-c]', '123-456') # match found at index 3

<re.Match object; span=(3, 4), match='-'>

In [36]:
# 2. Escape angle brackets
# escape angle brackets by placing it first in the regex
re.search(r'[\[o]', 'f[oo1]') 

<re.Match object; span=(1, 2), match='['>

In [37]:
# escape angle brackets by using backslash
re.search(r'[o\]]', 'f]oo1]') 

<re.Match object; span=(1, 2), match=']'>

In [38]:
# 3. Escaping the wildcard (.)

# here it matches any character that is not a newline
re.search('.', 'foo.bar') # finds 'f' as leftmost any character not a newline. Not escaped here

<re.Match object; span=(0, 1), match='f'>

In [39]:
# Escaping the wildcard (.) with backslash
re.search(r'\.', 'foo.bar') # Here wildcard is escaped, it is matched as literal character in the input string

<re.Match object; span=(3, 4), match='.'>

In [40]:
# 4. Escaping backslash 
# here we need to pass regex as a raw python string then escape backslash with another for regex parser to match only one backslash in the inpu string
str = r"Kelvin\Macharia"
print(str)
re.search(r"\\", str)

Kelvin\Macharia


<re.Match object; span=(6, 7), match='\\'>

In [41]:
# match two consecutive backslashes in an inpu string
str = r"Kelvin\\Macharia"
re.search(r"\\\\", str)

<re.Match object; span=(6, 8), match='\\\\'>

## Anchors

    Anchors specifies the location in the search string that the match must be found. i.e. emphasizes the order.

### 1. Caret (^) anchor: anchor the match at the begining of the input string. 

In [42]:
re.search('^foo', 'foobar') # substring 'foo' is at the begining of the string so it is found 

<re.Match object; span=(0, 3), match='foo'>

In [43]:
print(re.search('^foo', 'barfoo')) # substring 'foo' is not at the start of the string so no match.

None


In [44]:
# we can use \A for anchoring but this has special meaning in MULTILINE mode
re.search(r'\Afoo', 'foobar') # substring 'foo' is at the begining of the string so it is found 

<re.Match object; span=(0, 3), match='foo'>

In [45]:
print(re.search(r'\Afoo', 'barfoo')) # substring 'foo' is not at the start of the string so no match.

None


### 2. Dollar ($)
    Anchors to the end of the substring

In [46]:
re.search('bar$', 'foobar')

<re.Match object; span=(3, 6), match='bar'>

In [47]:
re.search(r'bar\Z', 'foobar') # \Z works too for this purpose

<re.Match object; span=(3, 6), match='bar'>

In [48]:
re.search('bar$', 'foobar\n') # '$' assumes the \n and just matches

<re.Match object; span=(3, 6), match='bar'>

In [49]:
print(re.search(r'bar\Z', 'foobar\n')) # \Z fails because techinically bar is not the end of string, there is a newline character
# \Z is strict

None


### 3. \b
    Anchors a match to a word boundary.

    Note: word characters are [a-zA-Z0-9_]

In [50]:
re.search(r"\bbar", "foo bar") # bar is at the begining of a new word

<re.Match object; span=(4, 7), match='bar'>

In [51]:
re.search(r"\bbar", "foo.bar") # bar is a the begining of new word

<re.Match object; span=(4, 7), match='bar'>

In [52]:
print(re.search(r"\bbar", "foobar")) # bar is not a separate word in the case

None


In [53]:
# \b 
re.search(r"foo\b", "foo bar")  # foo ends then a new word begins, there is a clear boundary

<re.Match object; span=(0, 3), match='foo'>

In [54]:
re.search(r"foo\b", "foo.bar") # foo ends then a new word begins, there is a clear boundary

<re.Match object; span=(0, 3), match='foo'>

In [55]:
print(re.search(r"foo\b", "foobar")) # no clear boundary,

None


In [56]:
# Word boundaries exist before and after "bar", so the match is valid because space is a non-word character
re.search(r'\bbar\b', 'foo bar baz')

<re.Match object; span=(4, 7), match='bar'>

In [57]:
# No word boundaries before but after "bar" there is, so the match is invalid because '_'  is a word character
print(re.search(r'\bbar\b', '_bar foo baz'))

None


In [58]:
# Word boundaries exist before and after "bar", so the match is valid because '-'  is a non-word character
re.search(r'\bbar\b', '-bar foo baz') 

<re.Match object; span=(1, 4), match='bar'>

### 4. \B
    Anchors to a location that isn't a word boundary. Opposite of \b


In [59]:
# No match because there is a word boundary
print(re.search(r'\Bfoo\B' , 'foo')) 

None


In [60]:
# There is a match because there is no word boundary
print(re.search(r'\Bfoo\B' , 'barfoobaz')) 

<re.Match object; span=(3, 6), match='foo'>


## Quantifiers

    Quantifiers appear after a portion of <regex> to indicate repetition of that particular portion.

### 1.   Asterisk (*)
        '*' matches zero or more characters

In [61]:
re.search('foo-*bar', 'foobar') # hypen not present, positive match

<re.Match object; span=(0, 6), match='foobar'>

In [62]:
re.search('foo-*bar', 'foo-bar') # one hypen present, positive match

<re.Match object; span=(0, 7), match='foo-bar'>

In [63]:
re.search('foo-*bar', 'foo--bar') # 2 hypens present, still a match. 

<re.Match object; span=(0, 8), match='foo--bar'>

### 2. Plus (+)
        Matches one or more repetition of preceeding regex

In [64]:
print(re.search('foo-+bar','foobar')) # atleast one hypen must be present for a match

None


In [65]:
re.search('foo-+bar','foo-bar') # one hyphen is present in the string so it's a match

<re.Match object; span=(0, 7), match='foo-bar'>

In [66]:
re.search('foo-+bar','foo--bar') # two hyphen are present in the string so it's a match. More occurences of hyphen will still make the string have a +ve match

<re.Match object; span=(0, 8), match='foo--bar'>

### 3. Question mark(?)
    Matches zero or one repetition 

In [67]:
re.search('foo-?bar','foobar') # matches when there is no hyphen

<re.Match object; span=(0, 6), match='foobar'>

In [68]:
re.search('foo-?bar','foo-bar') # matches when there only when there is one hyphen

<re.Match object; span=(0, 7), match='foo-bar'>

In [69]:
print(re.search('foo-?bar','foo--bar')) # no match if the hypens exceed one number. Can only be one or none

None


### 4.  Quantifiers metacharacters complementing other metacharacters.

        Quantifiers can complent each other in a regex for more refined matches

In [70]:
# a.  Asterisk complementing dot ( .* )
#        Matches zero or more occurence of any single character

# zero to any number of characters are found between foo and bar making it a positive match
re.search('foo.*bar', '# foo $qux@grault % bar #') 

<re.Match object; span=(2, 23), match='foo $qux@grault % bar'>

In [71]:
# b.  Asterisk complementing character class ( [range]* )
re.match('foo[0-9]*bar', 'foobar') # no decimal digits between foo and bar will be a match

<re.Match object; span=(0, 6), match='foobar'>

In [72]:
re.match('foo[0-9]*bar', 'foo1bar') # one decimal digits between foo and bar will be a match

<re.Match object; span=(0, 7), match='foo1bar'>

In [73]:
re.match('foo[0-9]*bar', 'foo123bar') # two or more decimal digits between foo and bar will be a match still

<re.Match object; span=(0, 9), match='foo123bar'>

In [74]:
# c.  plus complementing character class ( [range]+ )
print(re.match('foo[0-9]+bar', 'foobar')) # no decimal digits between foo and bar not a much. There must be atleast one

None


In [75]:
re.match('foo[0-9]+bar', 'foo1bar') # altleast one decimal digit must be present for a match

<re.Match object; span=(0, 7), match='foo1bar'>

In [76]:
re.match('foo[0-9]+bar', 'foo1234bar') # altleast one decimal digit must be present for a match

<re.Match object; span=(0, 10), match='foo1234bar'>

In [77]:
# c.  Question mark complementing character class ( [range]? )
re.match('foo[0-9]?bar', 'foobar') # zero decimal digit is a match

<re.Match object; span=(0, 6), match='foobar'>

In [78]:
re.match('foo[0-9]?bar', 'foo1bar') # one decimal digit is a match

<re.Match object; span=(0, 7), match='foo1bar'>

In [79]:
print(re.match('foo[0-9]?bar', 'foo12bar')) # more than one decimal digit is not a match. 
#  It is called non-greedy match where it matches zero or only one of the regex portion specified

None


### 5. *, + and ? non-greedy(lazy) versions

    By default, the quantifiers *, +, and ? in regular expressions are greedy — they match as much text as possible while still allowing the overall pattern to succeed.

Greedy Quantifiers:

'*' – Matches zero or more of the preceding element (greedy).

'+' – Matches one or more of the preceding element (greedy).

'?' – Matches zero or one of the preceding element (greedy).

Lazy (Non-Greedy) Quantifiers:

To make these quantifiers lazy, append a ? to them. This tells the regex engine to match as little text as possible while still allowing the overall pattern to succeed.

*? – Matches zero or more (lazy).

+? – Matches one or more (lazy).

?? – Matches zero or one (lazy).

Summary:
Greedy tries to match the longest possible substring.

Lazy tries to match the shortest possible substring.

In [80]:
# a. '.*?' - match any character zero or more times in a non-greedy way

# longest possible match is returned though '>' occur in several locations in the input string
re.search('<.*>', '%<foo> <bar> <baz>%') 

<re.Match object; span=(1, 18), match='<foo> <bar> <baz>'>

In [81]:
# .*? part (non-greedy) will match the smallest possible substring upto the first occurence of '>'
re.search('<.*?>', '%<foo> <bar> <baz>%')

<re.Match object; span=(1, 6), match='<foo>'>

In [82]:
# equivalent non-greedy match using negated character class
re.search('<[^>]*>', '%<foo> <bar> <baz>%')

<re.Match object; span=(1, 6), match='<foo>'>

In [83]:
# b. '.+?' - match any character one or more times in a non-greedy way.

# Example 1
# greedy match
re.search('<.+>', '%<foo> <bar> <baz>%')

<re.Match object; span=(1, 18), match='<foo> <bar> <baz>'>

In [84]:
# non-greedy match
re.search('<.+?>', '%<foo> <bar> <baz>%')

<re.Match object; span=(1, 6), match='<foo>'>

In [85]:
# non-greedy match
# nothing has been found in between the first occurence of '>', therefore the match extends to the next '>' ... 
# ... to ensure atleast one character is between '<' and '>'
re.search('<.+?>', '%<> <bar> <baz>%') # one or more charaters must be present between '<' and '>'

<re.Match object; span=(1, 9), match='<> <bar>'>

In [86]:
# Example 2
re.search('ba?', 'ba') # greedy version matches the longest possible, which is one 'a'

<re.Match object; span=(0, 2), match='ba'>

In [87]:
re.search('ba??', 'baaaa') # non-greedy, returns shortest ma i.e. zero 'a' and only return 'b'

<re.Match object; span=(0, 1), match='b'>

### 6. {m}
    This specifies exactly number of repetition

In [88]:
print(re.search('x-{3}x', 'x--x')) # exactly three dashes required for a positive match

None


In [89]:
re.search('x-{3}x', 'x---x') # three dashes presents, the string is a postive match

<re.Match object; span=(0, 5), match='x---x'>

In [90]:
print(re.search('x-{3}x', 'x----x')) # exactly three dashes not less not more. It is specific

None


### 6. {m,n}
    This specifies exactly number of repetition giving a range with m and m inclusive.

In [91]:
# string with 2,3 and 4 dashes are matched
for i in range(1,6):
    s = f"x{'-'*i}x"
    print(f'{i} {s:10}', re.search('x-{2,4}x', s))

1 x-x        None
2 x--x       <re.Match object; span=(0, 4), match='x--x'>
3 x---x      <re.Match object; span=(0, 5), match='x---x'>
4 x----x     <re.Match object; span=(0, 6), match='x----x'>
5 x-----x    None


### Other ways of specifying reps are:

`<regex>` {,n}  - means `<regex>` {0,n}

`<regex>` {m,}  - means `<regex>` {m, $\infty$}

`<regex>` {,}  - means any number of reps, can be implement as `<regex>*`

*Note: Empycurly braces have no meaning and will be taken as literal characters*

In [92]:
re.search("k{}n", "k{}n") # matches the string "k{}n"

<re.Match object; span=(0, 4), match='k{}n'>

### 7. {m,n}?

    This matches in a non-greedy way that is a string with atleast 'm' no. characters
    

In [93]:
# greedy match, matches the maximum of 5
re.search('a{3,5}', 'aaaaaaaa')

<re.Match object; span=(0, 5), match='aaaaa'>

In [94]:
# non-greedy match, matches the minimum of 3
re.search('a{3,5}?', 'aaaaaaaa')

<re.Match object; span=(0, 3), match='aaa'>

## Grouping Constructs

Grouping regex constructs breaks a regex into subexpression/groups to allow:

1. Grouping - The groups acts as single entity such that additional metacharacters such as quantitfiers apply to the group as a unit.
2. Capturing and Retrival - grouping construct captures search string matched allowing retrival for further processing.

In [95]:
# normal regex
re.search('bar', 'foo bar baz')

<re.Match object; span=(4, 7), match='bar'>

In [96]:
# groups match similar
re.search('(bar)', 'foo bar baz')

<re.Match object; span=(4, 7), match='bar'>

In [97]:
# using quantifiers on groups, regex treated as a unit
re.search('(bar)+', 'foo bar baz')

<re.Match object; span=(4, 7), match='bar'>

In [98]:
# using quantifiers on groups, the regex is treated as a unit
re.search('(bar)+', 'foo barbar baz')

<re.Match object; span=(4, 10), match='barbar'>

In [99]:
# using quantifiers on groups, the regex is treated as a unit
re.search('(bar)+', 'foo barbarbarbar baz')

<re.Match object; span=(4, 16), match='barbarbarbar'>

In [100]:
# without grouping only the preceeding characters is considered by the quantifier
re.search('bar+', 'foo bar baz')

<re.Match object; span=(4, 7), match='bar'>

In [101]:
# without grouping only the preceeding characters is considered by the quantifier
re.search('bar+', 'foo barbar baz') # repetion of the word bar is not considered because it is not grouped

<re.Match object; span=(4, 7), match='bar'>

In [102]:
# without grouping only the preceeding characters is considered by the quantifier
re.search('bar+', 'foo barr baz')

<re.Match object; span=(4, 8), match='barr'>

In [103]:
# without grouping only the preceeding characters is considered by the quantifier
re.search('bar+', 'foo barrr baz')

<re.Match object; span=(4, 9), match='barrr'>

### Complex grouping examples

    Simple grouping

In [104]:
re.search('(ba[rz]){2,4}(qux)?', 'bazbarbazbarqux')

<re.Match object; span=(0, 15), match='bazbarbazbarqux'>

In [105]:
# qux is optional
re.search('(ba[rz]){2,4}(qux)?', 'barbar')

<re.Match object; span=(0, 6), match='barbar'>

    Nested grouping

In [106]:
re.search(r'(foo(bar)?)+(\d\d\d)?', 'foofoobar')

<re.Match object; span=(0, 9), match='foofoobar'>

In [107]:
re.search(r'(foo(bar)?)+(\d\d\d)?', 'foofoobar123')

<re.Match object; span=(0, 12), match='foofoobar123'>

In [108]:
re.search(r'(foo(bar)?)+(\d\d\d)?', 'foofoo123')

<re.Match object; span=(0, 9), match='foofoo123'>

### Capturing groups

We can capture part of the search string that matched the group.

The re.search() returns an object or None. The object has several methods that can be used to retrive of information required.

These methods includes:

#### 1. m.groups():
   
    Returns a tuple containing all captured groups from the regex match

In [109]:
# save object in a variable
m = re.search(r'(\w+),(\w+),(\w+)', 'foo,quux,baz')
m

<re.Match object; span=(0, 12), match='foo,quux,baz'>

In [110]:
# retieve groups
m.groups() # returns a tuple of groups matched, the commas or anything else outside groups isn't returned

('foo', 'quux', 'baz')

#### 2. m.group(`<n>`):

    Returns a string containing the `<n>`th captured match.
    
    Regex objects areone-based not zero-based i.e. index starts at zero.

In [111]:
# retieve groups nth member.
m.group(1) # regex objects are one-based not zero-based i.e. index starts at zero

'foo'

In [112]:
m.group(2) #

'quux'

In [113]:
m.group(3) #

'baz'

In [114]:
# m.group() returns a string of the entire match
m.group() 

'foo,quux,baz'

In [115]:
# m.group(0) returns a string of the entire match same as m.group()
m.group(0) 

'foo,quux,baz'

#### 3. m.group(`<n1>, <n1>` ...)

    m.group() with multiple arguments returns a tuple containing the specified captured matches.

In [116]:
# m.groups() returns the entire tuple
m.groups()

('foo', 'quux', 'baz')

In [117]:
# m.group(<n1>, <n1> ...) returns a subset of the m.groups()
m.group(2,3)

('quux', 'baz')

In [118]:
m.group(3,2,1)

('baz', 'quux', 'foo')

## Back references

    Back references `\<n>` matches previously captured group.

    Numbered backreferences `\<n>` can only be 1 - 99 meaning only the first ninety nine(99) captured groups can be accessed.

    use:
    This is useful when you want to match the same text again without having to repeat the regex pattern reducing redundant code and 
    minimize errors defining a regular expressions.

In [119]:
# matches 1st instance of 'foo' then \1 is a backreference to the first captured group
re.search(r'(\w+),\1', 'foo,foo')

<re.Match object; span=(0, 7), match='foo,foo'>

In [120]:
re.search(r'(\w+),\1', 'qux,qux')

<re.Match object; span=(0, 7), match='qux,qux'>

In [121]:
print(re.search(r'(\w+),\1', 'foo,qux')) # 'qux' doesn't match the first group

None


In [122]:
# always specify regex as a raw string otherwise python mistakes the backreference as an octal value
print(re.search('([a-z])#\1', 'd#d'))

None


In [123]:
# when regex is a raw string, it works fine
print(re.search(r'([a-z])#\1', 'd#d'))

<re.Match object; span=(0, 3), match='d#d'>


In [124]:
# what python sees if regex pattern is not a raw sting
oct(ord('\1'))

'0o1'

## Advanced regex grouping concepts

### 1.  Named captured group

    (?P<name><regex>) creates a named captured group.
    
    This is similar to grouping parenthesis but the matched group can me referenced by the the given name in the <name> part instead of a number.

In [125]:
# accessing groups with numbers
m = re.search(r'(\w+),(\w+),(\w+)', 'foo,quux,baz')
print(m)
print(f'm.groups(): {m.groups()}')
print(f'm.group(1,2,3): {m.group(1,2,3)}')

<re.Match object; span=(0, 12), match='foo,quux,baz'>
m.groups(): ('foo', 'quux', 'baz')
m.group(1,2,3): ('foo', 'quux', 'baz')


In [126]:
# name the groups and access using this names
m = re.search(r'(?P<w1>\w+),(?P<w2>\w+),(?P<w3>\w+)', 'foo,quux,baz')
print(m)
print(f'm.groups(): {m.groups()}')
print(f"m.group('w1','w2','w3'): {m.group('w1','w2','w3')}")
print(f"m.group(1,2,3): {m.group(1,2,3)}") # group numbers still work

<re.Match object; span=(0, 12), match='foo,quux,baz'>
m.groups(): ('foo', 'quux', 'baz')
m.group('w1','w2','w3'): ('foo', 'quux', 'baz')
m.group(1,2,3): ('foo', 'quux', 'baz')


### 2. Backreferencing with captured named group

    (?P=<name>) matches content of a previosly captured named group.

In [127]:
# backreferencing with a named group
m = re.search(r'(?P<word>\w+),(?P=word)', 'foo,foo')
print(m)
m.group('word')

<re.Match object; span=(0, 7), match='foo,foo'>


'foo'

In [128]:
# similar example using group numbers
m = re.search(r'(\w+),\1', 'foo,foo')
print(m)
m.group(1)

<re.Match object; span=(0, 7), match='foo,foo'>


'foo'

### 3. Non-capturing group

    (?:<regex>) creates a non-capturing group which cannot be retrieved later or neither can it be backreferenced.

    Use and Pros:
    - You may need to capture some groups and not others, non-capturing allows you to declutter the result and only have what you need.
    -  Save mmemory by not capturing what is not needed. Improved performance slightly

In [129]:
m =  re.search(r'(\w+),(?:\w+),(\w+)','foo,quux,baz')
print(m)
print(f'groups(): {m.groups()}') # note the uncaptured group which missing in the tuple

<re.Match object; span=(0, 12), match='foo,quux,baz'>
groups(): ('foo', 'baz')


### 4. Condition matching

    (?(<n>)<yes-regex>|<no-regex>)
    (?(<name>)<yes-regex>|<no-regex>)

    We can specify conditional matching.
    This enables a match of either one of the two specified regexes depending on  existence of a group


##### Example using group numbers

In [130]:
regex = r'^(###)?foo(?(1)bar|baz)'
# group 1 exist so match 'foobar'
re.search(regex, '###foobar') 

<re.Match object; span=(0, 9), match='###foobar'>

In [131]:
# group 1 exist so match 'foobar', foobaz isn't a match
print(re.search(regex, '###foobaz'))

None


In [132]:
 # group 1 doesn't exist match foobaz. foobar won't be a match
print(re.search(regex, 'foobar'))

None


In [133]:
# group 1 doesn't exist match 'foobaz'. 
print(re.search(regex, 'foobaz'))

<re.Match object; span=(0, 6), match='foobaz'>


##### Example using named groups

In [134]:
# match optional non-word character as the 1st character of the string as group. 
# Last character will depend if the group one exist

regex = r'^(?P<ch>\W)?foo(?(ch)(?P=ch)|)$' 
# group 1 exist so same character must follow 'foo' for a positive match 
re.search(regex, '-foo-')

<re.Match object; span=(0, 5), match='-foo-'>

In [135]:
# group 1 doesn't exist so no character should follow 'foo' for a postive match
re.search(regex, 'foo')

<re.Match object; span=(0, 3), match='foo'>

## Lookahead and Lookbehind Assertions

    These determines failure or success of a regex match based on what is just hehind(left) or ahead(right) of current position.

### 1. Lookahead assertion

     Uses (?=<lookahead_regex>) to ensure that a regex match can only match if a section of the regex is followed by a certain sub-pattern(sort of     condition)

In [136]:
# lookahead assertion
re.search('foo(?=[a-z])','foobar') # asserts that foo must be followed by an alphabet in this case 'b' in 'bar' is an alphabet

<re.Match object; span=(0, 3), match='foo'>

In [137]:
print(re.search('foo(?=[a-z])','foo123')) # asserts that foo must be followed by an alphabet in this case '1' in '123' is not alphabet

None


Note: The lookahead isn't consumed, it is not part of the match object

In [138]:
m = re.search('foo(?=[a-z])(?P<ch>.)', 'foobar') # 
print(m.group('ch'))


m = re.search('foo([a-z])(?P<ch>.)', 'foobar')
print(m.group('ch'))

b
a


The lookahead assertion can be negated with:

`(?!<lookahead_regex>)`

In [139]:
re.search('foo(?=[a-z])', 'foobar') # match if lookahead is present

<re.Match object; span=(0, 3), match='foo'>

In [140]:
re.search('foo(?![a-z])', 'foo123') # match if the lookahead is not present

<re.Match object; span=(0, 3), match='foo'>

### 1. Lookbehind assertion

     Uses (?<=<lookahead_regex>) to ensure that a regex match can only match if a section of the regex is preceeded by a certain sub-pattern(sort of     condition)

In [141]:
re.search('(?<=foo)bar', 'foobar') # lookbehind present 'foo'

<re.Match object; span=(3, 6), match='bar'>

In [142]:
re.search('(?<=foo)bar', 'fobar') # lookbehind absent 'fo' not 'foo'

Lookbehind must be specific on the length

In [143]:
re.search('(?<=a+)def', 'aaadef') # fails because it is not specific on the length

PatternError: look-behind requires fixed-width pattern

In [None]:
re.search('(?<=a{3})def', 'aaadef') # passes, length is specific

Lookbehind can also be negated using:

`(?<!<lookbehind_regex>)`

In [None]:
print(re.search('(?<!foo)bar', 'foobar')) # for a +ve match, string must but portion must not be preceeded by 'foo'

re.search('(?<!qux)bar', 'foobar')

## Miscellaneous Metacharacters

    These are stray metacharacters that cannot be classified with others.

### 1. Comment (?#...)
        Specifies a comment inside a regex pattern for documentation purposes.

In [None]:
re.search(r'bar(?#This is a comment) *baz', 'foo bar baz qux')

### 2. Vertical bar or pipe(|)

    Specifies an alternative on which to match.

In [None]:
re.search('foo|bar|baz', 'bar')

In [None]:
re.search('foo|bar|baz', 'baz')

In [None]:
print(re.search('foo|bar|baz', 'quux'))

In [None]:
print(re.search('foo', 'foograult'))

print(re.search('grault', 'foograult'))

print(re.search('foo|grault', 'foograult')) # not greedy, matches the first substring and stops there even if a longer match is ahead

> 💡 **Tip**: To test regex interactively, go to [regex101.com](https://regex101.com/)

## Modifying Regular expression matching with Flags

    Most function in re module takes an optional <flag> arg to modifying regex parsing behavior

    re.search(<regex>,<string>,<flags>)

### 1. re.I (re.IGNORECASE)

    Makes matching case insensitive.    

In [None]:
re.search('a+', 'aaaAAA') # match the lowercase a's

In [None]:
re.search('A+', 'aaaAAA') # match the uppercase A's

In [None]:
re.search('a+', 'aaaAAA', re.I) # ignores case

In [None]:
re.search('A+', 'aaaAAA', re.IGNORECASE) # ignores case

### 2. re.M (re.MULTILINE)

    Causes start and end of string anchors to match at embedded newlines.
    By default anchors matches at the beginning and end of the search string

In [None]:
s = 'foo\nbar\nbaz'
# without multiline flag, internal lines are not matched
print(re.search('^foo',s)) # matches because it is anchored at the start of search string
print(re.search('^bar',s)) # doesn't match because the parser doesn't recognize start of internal newlines
print(re.search('^baz',s)) # doesn't match because the parser doesn't recognize start of internal newlines
print(f"{'-'*50}")
print(re.search('foo$',s)) # doesn't match because the parser doesn't recognize end of internal newlines
print(re.search('bar$',s)) # doesn't match because the parser doesn't recognize end of internal newlines
print(re.search('baz$',s)) # matches because it is anchored at the end of search string

In [None]:
s = 'foo\nbar\nbaz'
# with multiline flag, internal lines are recognized and anchors recognized in internal lines
print(re.search('^foo',s, re.MULTILINE)) # matches because it is anchored at the start of search string
print(re.search('^bar',s, re.MULTILINE)) # matches because it is anchored at the start an internal with MULTILINE flag on
print(re.search('^baz',s, re.MULTILINE)) # matches because it is anchored at the start an internal with MULTILINE flag on
print(f"{'-'*50}")
print(re.search('foo$',s, re.M)) # matches because it is anchored at the end of an internal with MULTILINE flag on.
print(re.search('bar$',s, re.M)) # matches because it is anchored at the end of an internal with MULTILINE flag on.
print(re.search('baz$',s, re.M)) # matches because it is anchored at the end of search string

### Strict Anchors
    These don't get affected by MULTILINE flag functionality.
    They don't recognize internal lines activated by MULTILINE Flag.
    They are strict on anchored at the start and end of search string not internal lines of the search string.
> a) Strict start of line anchor (\A)
>> Enforces start of search string despite internal lines and MULTILNE FLAG

In [None]:
s = 'foo\nbar\nbaz'
print(re.search('^bar', s, re.MULTILINE)) # bar is anchored at the start of internal newline so it is matched with ^ anchor
print(re.search(r'\Abar', s, re.MULTILINE)) # bar is anchored at the start of internal newline so it is not matched. \A enforces start of search string not internal lines
print(re.search(r'\Afoo', s, re.MULTILINE)) # foo is anchored at the start of search string not internal line so it is matched. 

> b) Strict start of line anchor (\Z)
>> Enforces end of search string despite internal lines and MULTILNE FLAG

In [None]:
s = 'foo\nbar\nbaz'
print(re.search('bar$', s, re.MULTILINE)) # bar is anchored at the end of internal newline so it is matched with $
print(re.search(r'bar\Z', s, re.MULTILINE)) # bar is anchored at the end of internal newline so not match. \Z enforces end of search string not end of internal lines
print(re.search(r'baz\Z', s, re.MULTILINE)) # baz is anchored at the end of string so there is a match

### 3. re.S (re.DOTALL)
> Makes the dit(.) metacharacter to match newline.
> 
> By defaul (.) matches any character except newline character.

In [None]:
# Matches anything except newline
re.search('.*', 'Kelvin\nMacharia')

In [None]:
# Matches anything inclusing newline character as with re.DOTALL
re.search('.*', 'Kelvin\nMacharia', re.S)

> 💡 **Tip** : re.S Not re.D

### 4. re.X (re.VERBOSE)
> Allows inclusion of whitespace and comments within a regex.

> With VERBOSE flag, the parser behaves a little different:

>> 1. Regex parser always ignores a whitespace unless is within a character class or escaped with a backslash.
>> 2. It also ignore the characters that follow a whitespace if not within a character class or escaped accordingly.

In [145]:
# with re.VERBOSE/re.X, the whitespaces and character on the right of it are ignored such that the regex pattern can span multine for readability
# When writing big regex patterns, always use re.VERBOSE to allow regex to span in multiple lines to allow proper documentation

regex = r'''^               # Start of string
            (\(\d{3}\))?    # Optional area code
            \s*             # Optional whitespace
            \d{3}           # Three-digit prefix
            [-.]            # Separator character
            \d{4}           # Four-digit line number
            $               # Anchor at end of string
            '''
print(re.search(regex, '414.9229')) # without re.VERBOSE, the pattern includes everything i.e. whitespaces and content intended to be comments
print(re.search(regex, '414.9229', re.X)) # re.X activate verbose mode

None
<re.Match object; span=(0, 8), match='414.9229'>


> 💡 **Tip** : re.X Not re.V

### 5. re.DEBUG(has no short single-letter)
>This displays debugging information.
>
> Information displayed by the DEBUG flag can help you troubleshoot by showing you how the parser is interpreting your regex.

In [147]:
# debug mode displays sort of interpretation of the regex e.g ASCII code of literal characters and
# ... meaning of metacharacters in the regex.


# simple example
re.search('foo.bar','fooxbar', re.DEBUG)

LITERAL 102
LITERAL 111
LITERAL 111
ANY None
LITERAL 98
LITERAL 97
LITERAL 114

 0. INFO 12 0b1 7 7 (to 13)
      prefix_skip 3
      prefix [0x66, 0x6f, 0x6f] ('foo')
      overlap [0, 0, 0]
13: LITERAL 0x66 ('f')
15. LITERAL 0x6f ('o')
17. LITERAL 0x6f ('o')
19. ANY
20. LITERAL 0x62 ('b')
22. LITERAL 0x61 ('a')
24. LITERAL 0x72 ('r')
26. SUCCESS


<re.Match object; span=(0, 7), match='fooxbar'>

In [149]:
# more complex example with debug mode on
regex = r'^(\(\d{3}\))?\s*\d{3}[-.]\d{4}$'
re.search(regex, '414.9229', re.DEBUG)

AT AT_BEGINNING
MAX_REPEAT 0 1
  SUBPATTERN 1 0 0
    LITERAL 40
    MAX_REPEAT 3 3
      IN
        CATEGORY CATEGORY_DIGIT
    LITERAL 41
MAX_REPEAT 0 MAXREPEAT
  IN
    CATEGORY CATEGORY_SPACE
MAX_REPEAT 3 3
  IN
    CATEGORY CATEGORY_DIGIT
IN
  LITERAL 45
  LITERAL 46
MAX_REPEAT 4 4
  IN
    CATEGORY CATEGORY_DIGIT
AT AT_END

 0. INFO 4 0b0 8 MAXREPEAT (to 5)
 5: AT BEGINNING
 7. REPEAT 21 0 1 (to 29)
11.   MARK 0
13.   LITERAL 0x28 ('(')
15.   REPEAT_ONE 9 3 3 (to 25)
19.     IN 4 (to 24)
21.       CATEGORY UNI_DIGIT
23.       FAILURE
24:     SUCCESS
25:   LITERAL 0x29 (')')
27.   MARK 1
29: MAX_UNTIL
30. REPEAT_ONE 9 0 MAXREPEAT (to 40)
34.   IN 4 (to 39)
36.     CATEGORY UNI_SPACE
38.     FAILURE
39:   SUCCESS
40: REPEAT_ONE 9 3 3 (to 50)
44.   IN 4 (to 49)
46.     CATEGORY UNI_DIGIT
48.     FAILURE
49:   SUCCESS
50: IN 5 (to 56)
52.   RANGE 0x2d 0x2e ('-'-'.')
55.   FAILURE
56: REPEAT_ONE 9 4 4 (to 66)
60.   IN 4 (to 65)
62.     CATEGORY UNI_DIGIT
64.     FAILURE
65:   SUCCESS


<re.Match object; span=(0, 8), match='414.9229'>

>> Note: Unicode encoding is the default for the python regex parser and this handles most languages. 

## Other regex Nitty Gritty
### 1. Combining flags

In [150]:
# bitwise OR (|) is used to specify more flag
re.search('^bar', 'FOO\nBAR\nBAZ', re.I|re.M) # both IGNORECASE & MULTILINE active

<re.Match object; span=(4, 7), match='BAR'>

### 2. Setting flags within a regular expression `(?<flags><regex>)`
> Apart from using the 3rd argument, we can set flags within the regex itself.

> The flags are represented with lettters as below:

| Letter | Flag | Full Name |
|:------:|:----:|:---------:|
| a | `re.A` | `re.ASCII` |
| i | `re.I` | `re.IGNORECASE` |
| L | `re.L` | `re.LOCALE` |
| m | `re.M` | `re.MULTILINE` |
| s | `re.S` | `re.DOTALL` |
| u | `re.U` | `re.UNICODE` |
| x | `re.X` | `re.VERBOSE` |


In [157]:
# DOTALL flag within a regex
# flag must be at the begining
re.search('(?s)foo.bar.baz', 'foo\nbar\nbaz')

<re.Match object; span=(0, 11), match='foo\nbar\nbaz'>

### 3. Setting flags within a regular expression in the duration of a group
> Challenge complexity can require groups responding to it's own flag.

In [160]:
# ignore case for a group
print(re.search('(?i:foo)bar', 'FOObar')) # case is ignored for 'foo'

<re.Match object; span=(0, 6), match='foobar'>


In [None]:
# ignore case for a group
print(re.search('(?i:foo)bar', 'FOOBAR')) # case is ignored for 'foo' but is not so no match

### 4. Removing flags within a regular expression in the duration of a group 
`(?-<flags>:<regex>)`

In [161]:
# Applying a common flag to the regex with the 3rd argument but remove it for a group within the regex
# In this case, re.IGNORECASE is applied on the entire regex but we use ?-i: for it not to apply to foo

# foo must be lowercase but case is ignored for bar
# (?-i:foo) turns off the re.IGNORECASE on this specific group duration
print(re.search('(?-i:foo)bar', 'FOOBAR', re.IGNORECASE)) 

None


In [162]:
# foo must be lowercase but case is ignored for bar
# (?-i:foo) turns off the re.IGNORECASE on this specific group duration
print(re.search('(?-i:foo)bar', 'fooBAR', re.IGNORECASE))

<re.Match object; span=(0, 6), match='fooBAR'>
