In [1]:
# Matching using the 'in' operator
s = "foo123bar"
"123" in s # True if substring is present

True

In [2]:
# Matching using string functions
s = "foo123bar"
s.find("123") # returns the start index of the substring if present

3

The above cases are simple, this works but not for more complex cases. Possible we may have to write verbose unecessary code. 
The regex comes in handy for solving cases. No need to reinvent the wheel.

# The re Module

re module is a standard library module, it provides this Regex functionality.

## `re.search(<regex/>, <string>)`

The search functon scans a string for the first location of regex match, if found it returns match object or None if note found.
It is the most simple of all functions in the re module.

### Example: Simple case that work with the 'in' operator and string functions


In [3]:
# character by character match
import re
s = 'foo123bar'
re.search("123", s) # returns a match object if substring is found otherwise None

<re.Match object; span=(3, 6), match='123'>

In [4]:
# match object is truthy, so it can be used in a Boolean context like with conditionals
if re.search("123", s):
    print("Found a match.")
else:
    print("No match found.")

Found a match.


In [5]:
# match object is truthy, so it can be used in a Boolean context like with conditionals
if re.search("1224", s):
    print("Found a match")
else:
    print("No match found")

No match found


## Regex Metacharacters
We can go a step higher with regex, instead of matching character by character we can use metacharacters to create flexible and 
generic patterns to be matched in the input strings. Application can include input validation and more where can define generic panel for emails address, phone numbers etc such user must follow the requirements

All regex metacharaters:
1. dot(.) - matches any single characters except a newline
2. Caret(^) - anchors a match at the begining of a string and complements a character class defined by angle brackets
3. Dollar($) - anchors a match at the end of a string.
4. Asterisk(*) - Matches zero or more repetitions.
5. Plus(+) - Matches one or more repetitions.
6. Question Mark(?) - Matches zero or one repetiton(non-greedy version of * and +).
7. Curly braces - Matches an explicitly specified number of repetitions in between the braces
8. Backslash(\) - Escapes metacharacters meaning in a regex defination but ensuring meaning are lite
9. Angle brackets ([]) - character class
10. Pipe (|) -
11. Brackets () - Creates a group.
12. :, #, =, ! - Specilized group
13. <> - creates named group

## Regex metacharacters that matches a single character

### 1.  Square brackets ([])
-  Square brackets [] defines a character class.  A character class matches any single character in the string

In [6]:
s = 'foo123bar'
re.search('[0-9][0-9][0-9]', s) # generic pattern to match any three consecutive decimal digits

<re.Match object; span=(3, 6), match='123'>

In [7]:
# works for any other string to find any three consecutive decimal digits
s = 'foo456bar'
re.search('[0-9][0-9][0-9]', s) # [0-9] is a range that matches a single character on the 0 to 9 range. 

<re.Match object; span=(3, 6), match='456'>

In [8]:
# None is returned if the string doesn't contain the pattern
s = '12foo46bar'
print(re.search('[0-9][0-9][0-9]', s)) 

None


In [9]:
# alphabet range
s = 'Goodluck'
re.search('Good[a-z][a-z][a-z][a-z]', s) 

<re.Match object; span=(0, 8), match='Goodluck'>

In [10]:
# specific character match
# The pattern finds whether the name has a single character which is 'k', 'c' or 'q' anywhere in the string
s = "kelvin"
re.search("[kcq]",s) 

<re.Match object; span=(0, 1), match='k'>

In [11]:
# matches single first occcurence of any of the characters in the character class(leftmost possible match).
s = "celvink"
re.search("[kcq]",s) 

<re.Match object; span=(0, 1), match='c'>

In [12]:
# muitiple range character class.
s = "---a0---"
re.search("[0-9a-zA-F]",s) # matches any hexadecimal digit character in the string

<re.Match object; span=(3, 4), match='a'>

As observed, this would be difficult when using the 'in' operator or string methods.

We can go further to complement a character class with other metacharacters

### Complementing with caret ^

In [13]:
# Complementing a character class with caret ^
re.search('[^0-9]', '12345foo') # matches any character that is not decimal digit 

<re.Match object; span=(5, 6), match='f'>

In [14]:
# if the caret appears anywhere in the regex pattern except the first character, it has no special meaning, it is taken literally.
re.search('[#:^]', 'foo^bar:baz#qux') # caret appears found as the 4th character in the string.

<re.Match object; span=(3, 4), match='^'>

### Metacharacters lose meaning inside a character class

Metacharacters placed inside a character class have no special meaning and only match as themselves as literal characters.

In [15]:
re.search('[)*+|]', '123*456') # '*' is found in the string as literal at index 3

<re.Match object; span=(3, 4), match='*'>

In [16]:
re.search('[)*+|]', '123+456') # '+' is found in the string as literal at index 3

<re.Match object; span=(3, 4), match='+'>

In [17]:
re.search('[)*^|]', '123+^456') # '^' is found in the string as literal at index 3

<re.Match object; span=(4, 5), match='^'>

### 2.  dot (.) (aka  wildcard)
-  The dot matches any single charater except a newline.

In [18]:
s = 'foo123bar'
re.search('1.3', s) # matches because 1 & 3 and there is a character in between which is not a newline

<re.Match object; span=(3, 6), match='123'>

In [19]:
s = 'foo13bar'
print(re.search('1.3', s)) # return None because there is no character between 1 & 3 in the string

None


In [20]:
s = 'foo1\n3bar'
print(re.search('1.3', s)) # return None because the character between 1 & 3 is a newline
# we can force (.) match nowline if we need to

None


### 3. \w  (lowercase 'w')
   Matches any alphanumeric word character. 
   
   These includes: uppercase and lowercase, digits and the underscore.
   
   If we used a character class we would define as [a-zA-Z0-9_] 

In [21]:
# using \w
str = "#(.a$@&"
re.search(r'\w', str)

<re.Match object; span=(3, 4), match='a'>

In [22]:
# same result if we use a character class
str = "#(.a$@&"
re.search('[a-zA-Z0-9_]', str)

<re.Match object; span=(3, 4), match='a'>

### 4. \W (uppercase 'W')

    Matches all non- word characters, equivalent to [^a-zA-z0-9_] character class


In [23]:
# using \W
re.search(r'\W', 'a_1*3Qb') # find '*' at index 3 because it is non-word character

<re.Match object; span=(3, 4), match='*'>

In [24]:
# using [^a-zA-z0-9_] character class
re.search('[^a-zA-Z0-9_]', 'a_1*3Qb') # find '*' at index 3 because it is non-word character

<re.Match object; span=(3, 4), match='*'>

### 5. \d
    Matches decimal digit characters or [0-9] character class


In [25]:
re.search(r"\d", "Kelvin6") # Finds '6' at index 6

<re.Match object; span=(6, 7), match='6'>

### 6. \D

    Matches character that are not decimal digit or [^0-9] character class

In [26]:
re.search(r"\D", "Kelvin6")  # Finds 'K' at index 0, the leftmost non-decimal digit character

<re.Match object; span=(0, 1), match='K'>

### 7. \s
   Matches a whitespace.

In [27]:
re.search(r'\s', 'foo\nbar baz') # unlike dt(.) \s matches newline cause it is considered a whitespace

<re.Match object; span=(3, 4), match='\n'>

### 8. \S

    Matches everything ecxept whitespace.

In [28]:
re.search(r'\S', '  \n foo  \n  ') # matches f because it the leftmost non-white space character

<re.Match object; span=(4, 5), match='f'>

### 9. Using \w, \W, \d, \D, \s  and \S in a character class

    Follows concept as discussed above.

In [29]:
re.search(r'[\d\w\s]', '---3---') # matches single digit or word or whitespace character

<re.Match object; span=(3, 4), match='3'>

In [30]:
re.search(r'[\d\w\s]', '---a---') # matches single digit or word or whitespace character

<re.Match object; span=(3, 4), match='a'>

In [31]:
re.search(r'[\d\w\s]', '------') # matches single digit or word or whitespace character

In [32]:
# can be shorted as \w also included \d
re.search(r'[\w\s]', '---3---') # matches single digit or word or whitespace character

<re.Match object; span=(3, 4), match='3'>

### Escaping metacharacter

Sometimes we may want the metacharacters to match as literal characters not have special meaning

In [33]:
# Escaping character

# 1. Escape hyphen
# in search cases it should be defined as the first or last character or escape it
# define hyphen as first character
re.search('[-abc]', '123-456') # match found at index 3

<re.Match object; span=(3, 4), match='-'>

In [34]:
# define hyphen as last character
re.search('[abc-]', '123-456') # match found at index 3

<re.Match object; span=(3, 4), match='-'>

In [35]:
# escape hyphen using '\' in the pattern
re.search(r'[ab\-c]', '123-456') # match found at index 3

<re.Match object; span=(3, 4), match='-'>

In [36]:
# 2. Escape angle brackets
# escape angle brackets by placing it first in the regex
re.search(r'[\[o]', 'f[oo1]') 

<re.Match object; span=(1, 2), match='['>

In [37]:
# escape angle brackets by using backslash
re.search(r'[o\]]', 'f]oo1]') 

<re.Match object; span=(1, 2), match=']'>

In [38]:
# 3. Escaping the wildcard (.)

# here it matches any character that is not a newline
re.search('.', 'foo.bar') # finds 'f' as leftmost any character not a newline. Not escaped here

<re.Match object; span=(0, 1), match='f'>

In [39]:
# Escaping the wildcard (.) with backslash
re.search(r'\.', 'foo.bar') # Here wildcard is escape, it matched as literal character in the input string

<re.Match object; span=(3, 4), match='.'>

In [40]:
# 4. Escaping backslash 
# here we need to pass regex as a raw python string then escape backslash with another for regex parser to match only one backslash in the inpu string
str = r"Kelvin\Macharia"
print(str)
re.search(r"\\", str)

Kelvin\Macharia


<re.Match object; span=(6, 7), match='\\'>

In [41]:
# match two consecutive backslashes in an inpu string
str = r"Kelvin\\Macharia"
re.search(r"\\\\", str)

<re.Match object; span=(6, 8), match='\\\\'>

## Anchors

    Anchors specifies the location in the search string that the match must be found. i.e. emphasizes on on the order.

### 1. Caret (^) anchor: anchor the match at the begining od the string. 

In [42]:
re.search('^foo', 'foobar') # substring 'foo' is at the begining of the string so it is found 

<re.Match object; span=(0, 3), match='foo'>

In [43]:
print(re.search('^foo', 'barfoo')) # substring 'foo' is not at the start of the string so no match.

None


In [44]:
# we can use \A for anchoring but has special meaning in MULTILINE mode
re.search(r'\Afoo', 'foobar') # substring 'foo' is at the begining of the string so it is found 

<re.Match object; span=(0, 3), match='foo'>

In [45]:
print(re.search(r'\Afoo', 'barfoo')) # substring 'foo' is not at the start of the string so no match.

None


### 2. Dollar ($)
    Anchors to the end of the substring

In [46]:
re.search('bar$', 'foobar')

<re.Match object; span=(3, 6), match='bar'>

In [47]:
re.search(r'bar\Z', 'foobar') # \Z works too for this purpose

<re.Match object; span=(3, 6), match='bar'>

In [48]:
re.search('bar$', 'foobar\n') # '$' assumes the \n

<re.Match object; span=(3, 6), match='bar'>

In [49]:
print(re.search(r'bar\Z', 'foobar\n')) # \Z fails because techinically bar is not the end of string, there is a newline character

None


### 3. \b
    Anchors a match to a word boundary.

    Note: word characters are [a-zA-Z0-9_]

In [50]:
re.search(r"\bbar", "foo bar") # bar is a the begining of new word

<re.Match object; span=(4, 7), match='bar'>

In [51]:
re.search(r"\bbar", "foo.bar") # bar is a the begining of new word

<re.Match object; span=(4, 7), match='bar'>

In [52]:
print(re.search(r"\bbar", "foobar")) # bar is not a separate word in the case

None


In [53]:
# \b 
re.search(r"foo\b", "foo bar")  # foo ends then a new word begins, there is a clear boundary

<re.Match object; span=(0, 3), match='foo'>

In [54]:
re.search(r"foo\b", "foo.bar") # foo ends then a new word begins, there is a clear boundary

<re.Match object; span=(0, 3), match='foo'>

In [55]:
print(re.search(r"foo\b", "foobar")) # no clear boundary,

None


In [56]:
# Word boundaries exist before and after "bar", so the match is valid because space is a non-word character
re.search(r'\bbar\b', 'foo bar baz')

<re.Match object; span=(4, 7), match='bar'>

In [57]:
# No word boundaries before but after "bar" there is, so the match is invalid because '_'  is a word character
print(re.search(r'\bbar\b', '_bar foo baz'))

None


In [58]:
# Word boundaries exist before and after "bar", so the match is valid because '-'  is a non-word character
re.search(r'\bbar\b', '-bar foo baz') 

<re.Match object; span=(1, 4), match='bar'>

### 4. \B
    Anchors to a location that isn't a word boundary. Opposite of \b


In [59]:
# No match because there is a word boundary
print(re.search(r'\Bfoo\B' , 'foo')) 

None


In [60]:
# There is a match because there is no word boundary
print(re.search(r'\Bfoo\B' , 'barfoobaz')) 

<re.Match object; span=(3, 6), match='foo'>


## Quantifiers

    Quantifiers appear after a portion of <regex> to indicate repetition of that particular portion.