In [1]:
# Matching using the 'in' operator
s = "foo123bar"
"123" in s # True if present

True

In [2]:
# Matching using string functions
s = "foo123bar"
s.find("123") # return the start index of the substring

3

In [3]:
# Matching using string functions
s = "foo123bar"
s.index("123") # return the start index of the substring

3

The above cases are simple, this works but for more complex cases. The regex comes in handy.

# The re Module

re module is a standard library module, it provides this Regex functionality.

## `re.search(<regex/>, <string>)`

The search functon scans a string for the first location of regex match, if found it returns match object or None if note found.

### Example: Simple case that work with the 'in' operator and string functions


In [4]:
# character by character match
import re
s = 'foo123bar'
re.search("123", s) # returns a match object if substring is found otherwise none

<re.Match object; span=(3, 6), match='123'>

In [5]:
# match object is truthy, so it can be used in a Boolean context
if re.search("123", s):
    print("Found a match")
else:
    print("No match")

Found a match


## Regex Metacharacters
We can go a step higher with regex, instead of matching character by character we can use metacharacters to create flexible and 
generic patterns to be matched in the input strings.

All regex metacharaters:
1. dot(.) - matches any single characters except a newline
2. Caret(^) - anchors a match at the begining of a string and complements a character class define by angle brackets
3. Dollar($) - anchors a match at the end of a string.
4. Asterisk(*) - Matches zero or more repetitions.
5. Plus(+) - Matches one or more repetitions.
6. Question Mark(?) - Matches zero or on repetiton.
7. Curly braces - Matches an explicitly specified number of repetitions in between the braces
8. Backslash(\) - Escapes metacharacters meaning in regex
9. Angle brackets ([]) - character class
10. Pipe (|) -
11. Brackets () - Creates a group.
12. :, #, =, ! - Specilized group
13. <> - creates named group

## Regex metacharacters that match a single character

### 1.  Square brackets ([])
-  Square brackets [] defines a character class.  A character class matches any single character in the string

In [6]:
s = 'foo123bar'
re.search('[0-9][0-9][0-9]', s) # generic pattern to match any three consecutive decimal digits

<re.Match object; span=(3, 6), match='123'>

In [7]:
# works for any other string to find any three consecutive decimal digits
s = 'foo456bar'
re.search('[0-9][0-9][0-9]', s) # [0-9] is a range that matches a single character on the 0 to 9 range. 

<re.Match object; span=(3, 6), match='456'>

In [8]:
# None is returned if the string doesn't contain the pattern
s = '12foo46bar'
print(re.search('[0-9][0-9][0-9]', s)) 

None


In [9]:
# alphabet range
s = 'Goodluck'
re.search('Good[a-z][a-z][a-z][a-z]', s) 

<re.Match object; span=(0, 8), match='Goodluck'>

In [10]:
# specific character match
# The pattern finds whether the name has a single character which is 'k', 'c' or 'q' anywhere in the string
s = "kelvin"
re.search("[kcq]",s) 

<re.Match object; span=(0, 1), match='k'>

In [11]:
# matches single first occcurence of any of the characters in the character class(leftmost possible match).
s = "celvink"
re.search("[kcq]",s) 

<re.Match object; span=(0, 1), match='c'>

In [12]:
# muitiple range character class.
s = "---a0---"
re.search("[0-9a-zA-F]",s) # matches any hexadecimal digit character in the string

<re.Match object; span=(3, 4), match='a'>

As observed, this would be nearly impossible when using the 'in' operator or string methods.

Also we can complement a character class with other metacharacters

### Complementing with caret ^

In [13]:
# Complementing a character class with caret ^
re.search('[^0-9]', '12345foo') # matches any character that is not decimal digit 
# if the caret appears anywhere in the regex pattern except the first character, it has no special meaning, it is taken literally
re.search('[#:^]', 'foo^bar:baz#qux') # caret appears found as the 4th character in the string

<re.Match object; span=(3, 4), match='^'>

### Escaping metacharacter

Sometimes we may want the metacharacters to match as literal characters not have special meaning.

In [14]:
# Escaping character
# hyphen is used to define a range in a character class but can also be required in the string
# in search cases it should be defined as the first or last or escape it

# define hyphen as first character
re.search('[-abc]', '123-456') # match found at index 3

<re.Match object; span=(3, 4), match='-'>

In [15]:
# define hyphen as last character
re.search('[abc-]', '123-456') # match found at index 3

<re.Match object; span=(3, 4), match='-'>

In [16]:
# escape hyphen using '\' in the pattern
re.search(r'[ab\-c]', '123-456') # match found at index 3

<re.Match object; span=(3, 4), match='-'>

In [21]:
# escape angle brackets by placing it first in the regex
re.search('[[o]', 'f[oo1]') 

<re.Match object; span=(1, 2), match='['>

In [24]:
# escape angle brackets by using backslash
re.search(r'[o\]]', 'f]oo1]') 

<re.Match object; span=(1, 2), match=']'>

### Metacharacters lose meaning inside a character class

Metacharacters placed inside a character class have no special meaning and only match as themselves as literal characters.

In [27]:
re.search('[)*+|]', '123*456') # '*' is found in the string as literal at index 3

<re.Match object; span=(3, 4), match='*'>

In [28]:
re.search('[)*+|]', '123+456') # '+' is found in the string as literal at index 3

<re.Match object; span=(3, 4), match='+'>

### 2.  dot (.)
-  The dot matches any single charater except a newline.

In [18]:
s = 'foo123bar'
re.search('1.3', s) # matches because 1 & 3 and there is a character in between which is not a newline

<re.Match object; span=(3, 6), match='123'>

In [19]:
s = 'foo13bar'
print(re.search('1.3', s)) # return None because there is no character between 1 & 3 in the string

None
