## Introduction to Regular Expressions 

Regular Expression (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the `re` module. 

- A Regular Expression is a sequence of characters containing a pattern. It helps in finding and also replacing strings. Python provides a module called `re` (stands for the regular expression) for this purpose.

- Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways. 

- Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C. For advanced use, it may be necessary to pay careful attention to how the engine will execute a given RE, and write the RE in a certain way in order to produce bytecode that runs faster. 

- The regular expression language is relatively small and restricted, so not all possible string processing tasks can be done using regular expressions. There are also tasks that can be done with regular expressions, but the expressions turn out to be very complicated. In these cases, you may be better off writing Python code to do the processing; while Python code will be slower than an elaborate regular expression, it will also probably be more understandable.

- Regular expression has been using in the financial crime analaytics by matching the multiple similar but slightly different names and some other areas.


### Writing RE Patterns

1. Repetitions

2. Character Sets & Character Ranges

3. Escape Codes

4. Anchoring 

5. Flags

6. Groups and Named Groups

### Writing RE Patterns 1. Repetitions

In [26]:
import re

### re.search()

In [27]:
text1 = "This is a beautiful day"

In [28]:
# Scan through string looking for a match to the pattern, returning a Match object, or None if no match was found.
re.search(r'is',text1)

<re.Match object; span=(2, 4), match='is'>

In [29]:
m = re.search(r'is',text1)
m

<re.Match object; span=(2, 4), match='is'>

In [30]:
# Return subgroup(s) of the match by indices or names. For 0 returns the entire match.
m.group()

'is'

In [31]:
m.start(), m.end(), m.span()

(2, 4, (2, 4))

### re.match(pattern, string, flags=0)

In [32]:
m = re.match(r'is',text1)
print(m)

None


In [33]:
m = re.match(r'Th',text1)
print(m)

<re.Match object; span=(0, 2), match='Th'>


In [34]:
m.group(), m.start(), m.end(), m.span()

('Th', 0, 2, (0, 2))

### re.findall()

In [35]:
# Return a list of all non-overlapping matches in the string
m = re.findall(r'is',text1)
m

['is', 'is']

In [36]:
text2 = "abbbaaabbbbabababa"

In [37]:
re.findall(r'ba', text2)

['ba', 'ba', 'ba', 'ba', 'ba']

In [38]:
# Return an iterator over all non-overlapping matches in the string
mat = re.finditer(r'ba', text2)
mat

<callable_iterator at 0x2453cb10940>

In [39]:
for m in mat:
    print(m.group(), m.start(), m.end(), m.span())

ba 3 5 (3, 5)
ba 10 12 (10, 12)
ba 12 14 (12, 14)
ba 14 16 (14, 16)
ba 16 18 (16, 18)


In [40]:
type(mat)

callable_iterator

In [41]:
# re.sub(pattern, repl, string, count=0, flags=0)
# Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the
# replacement repl.  repl can be either a string or a callable; if a string, backslash escapes in it are processed.  
# If it is a callable, it's passed the Match object and must return a replacement string to be used.
re.sub(r'ba', 'xy', text2)

'abbxyaabbbxyxyxyxy'

In [42]:
 re.sub(r'ba', 'xy', text2, count=2,flags=re.IGNORECASE)

'abbxyaabbbxybababa'

In [43]:
# Compile a regular expression pattern, returning a Pattern object
pat = re.compile(r'ba')

In [44]:
print(pat)

re.compile('ba')


In [45]:
re.findall(pat,text2)

['ba', 'ba', 'ba', 'ba', 'ba']

In [46]:
text3 = "akaks ksdkdkd; aksa&kks: ajsjss, shshs; ususu; hshs"

In [47]:
# Return a list of the words in the string, using sep as the delimiter string.
# None (the default value) means split according to any whitespace

text3.split()

['akaks', 'ksdkdkd;', 'aksa&kks:', 'ajsjss,', 'shshs;', 'ususu;', 'hshs']

In [48]:
text3.split(';')

['akaks ksdkdkd', ' aksa&kks: ajsjss, shshs', ' ususu', ' hshs']

In [49]:
re.split(r'[ ;:,]\s',text3)

['akaks ksdkdkd', 'aksa&kks', 'ajsjss', 'shshs', 'ususu', 'hshs']

In [50]:
re.split(r'[ ;:,]\s*',text3)

['akaks', 'ksdkdkd', 'aksa&kks', 'ajsjss', 'shshs', 'ususu', 'hshs']

### Writting RE - Repetition Coding Examples

In [55]:
text = "ab abb a a a abbbb abbbbbbb bbb"

#### 'ab*' - `a` followed by `zero or more b's`.

In [56]:
re.findall(r'ab*',text)

['ab', 'abb', 'a', 'a', 'a', 'abbbb', 'abbbbbbb']

#### 'ab+' - `a` followed by `one or more (>=1) b's.`

In [57]:
re.findall(r'ab+',text)

['ab', 'abb', 'abbbb', 'abbbbbbb']

#### 'ab?' - `a` followed by `zero or one (>=0 and <=1, 0 or 1) b's.`

In [58]:
re.findall(r'ab?', text)

['ab', 'ab', 'a', 'a', 'a', 'ab', 'ab']

In [59]:
re.findall(r'abb', text)

['abb', 'abb', 'abb']

In [62]:
re.findall(r'ab{4}', text) # find bb

['abbbb', 'abbbb']

In [63]:
text

'ab abb a a a abbbb abbbbbbb bbb'

In [64]:
re.findall(r'ab{3,50}', text) # find bbb up to 50bs

['abbbb', 'abbbbbbb']

In [65]:
re.findall(r'ab*?',text)

['a', 'a', 'a', 'a', 'a', 'a', 'a']

In [66]:
re.findall(r'ab+?',text)

['ab', 'ab', 'ab', 'ab']

In [67]:
re.findall(r'ab??',text)

['a', 'a', 'a', 'a', 'a', 'a', 'a']

In [68]:
re.findall(r'ab.',text)

['ab ', 'abb', 'abb', 'abb']

### Writing RE Patterns 2: Character Sets & Ranges

- A character set is a group of character enclosed in square brackets`[ ]`, any one of which can match at that point in the pattern.

- As character sets grow larger, typing every character that should match could become very tedious. A more compact format is using character ranges

#### Examples:

`a[xy]`: would match either ax or ay.

`a[^xy]`: would exclude either ax and ay. (^ excludes character in [ ])

`[a-k]`: match with any character between a and k

`[^a-k]`: exclude matches with any character between a and k

In [69]:
text = "xyyxxyyyzzzxo"

In [70]:
re.findall(r'[xy]', text)

['x', 'y', 'y', 'x', 'x', 'y', 'y', 'y', 'x']

In [71]:
re.findall(r'[xyzo]', text)

['x', 'y', 'y', 'x', 'x', 'y', 'y', 'y', 'z', 'z', 'z', 'x', 'o']

In [72]:
re.findall(r'[xyz]', text)

['x', 'y', 'y', 'x', 'x', 'y', 'y', 'y', 'z', 'z', 'z', 'x']

In [73]:
re.findall(r'[abc]', text)

[]

In [74]:
re.findall(r'x[xy]', text)

['xy', 'xx']

In [75]:
re.findall(r'x[xy]+', text)

['xyyxxyyy']

In [76]:
re.findall(r'x[xy]+?', text)

['xy', 'xx']

In [77]:
text = "xxy xyxyx xaxb xxyy aaxz"

In [78]:
 re.findall(r'x[^xy]+', text)

['x ', 'xa', 'xb ', 'xz']

In [79]:
 re.findall(r'x[^xy]+?', text)


['x ', 'xa', 'xb', 'xz']

In [80]:
 text = "This a sample text. -- with some Punctuation marks!!!"

In [81]:
 re.findall(r'[A-Z][a-z]*', text)

['This', 'Punctuation']

In [82]:
 re.findall(r'[A-Z][a-z]+', text)

['This', 'Punctuation']

In [83]:
 re.findall(r'[^.\-! ]+', text)

['This', 'a', 'sample', 'text', 'with', 'some', 'Punctuation', 'marks']

### Writing RE Patterns 3. Escape Codes
    
We can use special escape codes to find specific types of patterns in data, such as digits, non-digits, whitespace etc.

`\d`: match a single digit

`\D`: match a single non-digit

`\w`: match a single alphanumeric character

`\W`: match a single non-alphanumeric character

`\s`: match a single whitespace character (tab, space, newline, etc.)

`\S`: match a single non-whitespace character (tab, space, newline, etc.)

In [84]:
text = "The cost of Python course is $125."

In [85]:
re.findall(r'\d', text)

['1', '2', '5']

In [86]:
re.findall(r'\d+', text)

['125']

In [87]:
re.findall(r'\D', text)

['T',
 'h',
 'e',
 ' ',
 'c',
 'o',
 's',
 't',
 ' ',
 'o',
 'f',
 ' ',
 'P',
 'y',
 't',
 'h',
 'o',
 'n',
 ' ',
 'c',
 'o',
 'u',
 'r',
 's',
 'e',
 ' ',
 'i',
 's',
 ' ',
 '$',
 '.']

In [88]:
re.findall(r'\D+', text)

['The cost of Python course is $', '.']

In [89]:
re.findall(r'\s', text)

[' ', ' ', ' ', ' ', ' ', ' ']

In [90]:
re.findall(r'\S', text)

['T',
 'h',
 'e',
 'c',
 'o',
 's',
 't',
 'o',
 'f',
 'P',
 'y',
 't',
 'h',
 'o',
 'n',
 'c',
 'o',
 'u',
 'r',
 's',
 'e',
 'i',
 's',
 '$',
 '1',
 '2',
 '5',
 '.']

In [91]:
re.findall(r'\S+', text)

['The', 'cost', 'of', 'Python', 'course', 'is', '$125.']

In [92]:
re.findall(r'\w+', text)

['The', 'cost', 'of', 'Python', 'course', 'is', '125']

In [93]:
re.findall(r'\W+', text)

[' ', ' ', ' ', ' ', ' ', ' $', '.']

### Writing RE Patterns 4. Anchoring

In addition to describing the content of a pattern to match, the relative location can be specified in the input text where the pattern should appear by using anchoring instructions.

`^` -- Start of a string

`$` -- End of a string

`\A` -- Start of a string

`\Z` -- End of a string

`\b` -- Empty String at the begining and end of a word
    

In [94]:
text = "This is a beautiful day."

In [95]:
 re.findall(r'is', text)

['is', 'is']

In [96]:
re.findall(r'^is', text)

[]

In [97]:
re.findall(r'^T', text)

['T']

In [98]:
 re.findall(r'\.$', text)

['.']

In [99]:
 re.findall(r'\bis\b', text)


['is']

In [100]:
re.search(r'\bis\b', text)


<re.Match object; span=(5, 7), match='is'>

In [101]:
slash1 = '\b'
slash2 = '\\b'
slash3 = r'\\b'
#print(slash1)
print(slash2)
print(slash3)

\b
\\b


### Writing RE Patterns 5. Flags
Sometimes, we need to slightly tweak the behavior of the Regular Expression. The Regular Expression engine in Python, offer a small number of flags that modify the behavior of the entire expression.

- `re.IGNORECASE or re.I or 2`: makes the regular expression case-insensitive

- `re.DOTALL or re.S or 16`: makes the `.` character to include `\n` newline character

- `re.MULTILINE or re.M or 8`: makes the `^` and `$` character, which normally would only match against the beginning or end of the string, to instead match against the beginning or end of any line wihtin the string

- `re.VERBOSE or re.X or 64`: makes complicated regular expression to be more readable. This flag does two things: 
   - 1) it causes all whitespace (other than in character classes) to be ignored, including ine breaks; 
   - 2) it treats the # character (Again, unless it's inside a character class) as a comment character

- `re.DEBUG or 128`: provides some debugging information while compiling a regular expression

- `Multiple Flags`: Some times we may have to use multiple flags at the same time, this is done using bitwise `OR` operator. Example: re.I | re.S | re.M

In [102]:
text = "Python python PYTHON"

In [103]:
re.findall(r'Python', text)

['Python']

In [104]:
re.findall(r'Python', text, re.IGNORECASE)

['Python', 'python', 'PYTHON']

In [105]:
 re.findall(r'Python', text, re.I )

['Python', 'python', 'PYTHON']

In [106]:
re.findall(r'Python', text, 2)

['Python', 'python', 'PYTHON']

In [107]:
 re.I

re.IGNORECASE

In [108]:
re.S

re.DOTALL

In [109]:
text = "Py\nthon"
re.findall(r'.+', text)

['Py', 'thon']

In [110]:
re.findall(r'.+', text, re.DOTALL )

['Py\nthon']

In [111]:
 text = "Python is fun. Learning python."

In [112]:
re.sub(r'Py', 'My', text )

'Mython is fun. Learning python.'

In [113]:
re.sub(r'Py', 'My', text, flags=re.I )

'Mython is fun. Learning Mython.'

In [114]:
 re.sub(r'Py', 'My', text, count=1, flags=re.I )

'Mython is fun. Learning python.'

### Writing RE Patterns 6. Grouping and Named groups.

Regular Expressions provide a mechanism to split the expression into groups. When using groups, we will be able to select each individual group within the match in addition to getting the entire match. You can specify groups within a regular expression by using parentheses.

In [115]:
text = '123-4567 is my telephone.'

In [116]:
 re.findall(r'[\d]{3}-[\d]{4}', text)

['123-4567']

In [117]:
m = re.search(r'([\d]{3})-([\d]{4})', text)

In [118]:
 print(m)

<re.Match object; span=(0, 8), match='123-4567'>


In [119]:
m.group()

'123-4567'

In [120]:
 m.groups()

('123', '4567')

In [121]:
 m.group(1)

'123'

In [122]:
m.group(2)

'4567'

In [123]:
m = re.search(r'(?P<first3>[\d]{3})-(?P<last4>[\d]{4})', text)

In [124]:
 m.group('first3')

'123'

In [125]:
m.group('last4')

'4567'

### Writing REs: A practical example -- Step by step

In [126]:
text = [ '123 456 7890', '(123) 456 7890', '123-456-7890']

In [127]:
pat = r'\(?\d{3}\)?\s\d{3}\s\d{4}'
for d in text:
    m = re.search(pat, d)
    if m:
        print(m.group())

123 456 7890
(123) 456 7890


In [128]:
 pat = r'\(\d{3}\)\s\d{3}\s\d{4}'

In [129]:
for d in text:
    m = re.search(pat, d)
    if m:
        print(m.group())

(123) 456 7890


In [130]:
text = [ '123 456 7890', '(123) 456 7890', '123-456-7890', '123.456.7890']
pat = r'\(?\d{3}\)?[\s\-]\d{3}[\s\-]\d{4}'

for d in text:
    m = re.search(pat, d)
    if m:
        print(m.group())

123 456 7890
(123) 456 7890
123-456-7890


In [131]:
text = [ '123 456 7890', '(123) 456 7890', '123-456-7890', '123.456.7890']
pat = r'\(?\d{3}\)?[\s\-\.]\d{3}[\s\-\.]\d{4}'

for d in text:
    m = re.search(pat, d)
    if m:
        print(m.group())

123 456 7890
(123) 456 7890
123-456-7890
123.456.7890


In [132]:
text = [ '123 456 7890', '(123) 456 7890', '123-456-7890', '123.456.7890','1234567890']
pat = r'\(?\d{3}\)?[\s\-\.]?\d{3}[\s\-\.]?\d{4}'

for d in text:
    m = re.search(pat, d)
    if m:
        print(m.group())

123 456 7890
(123) 456 7890
123-456-7890
123.456.7890
1234567890


In [133]:
text = [ '1 123 456 7890', '+1 (123) 456 7890', '123-456-7890', '123.456.7890','1234567890']
pat = r'\d?\s?\(?\d{3}\)?[\s\-\.]?\d{3}[\s\-\.]?\d{4}'

for d in text:
    m = re.search(pat, d)
    if m:
        print(m.group())

1 123 456 7890
1 (123) 456 7890
123-456-7890
123.456.7890
1234567890


In [134]:
text = [ '1 123 456 7890', '+1 (123) 456 7890', '123-456-7890', '123.456.7890','1234567890']
pat = r'\+?\d?\s?\(?\d{3}\)?[\s\-\.]?\d{3}[\s\-\.]?\d{4}'

for d in text:
    m = re.search(pat, d)
    if m:
        print(m.group())

1 123 456 7890
+1 (123) 456 7890
123-456-7890
123.456.7890
1234567890


In [135]:
text = [ '1 123 456 7890', '+1 (123) 456 7890', '123-456-7890', '123.456.7890','1234567890']
pat = r'\+?\d?\s?\(?\d{3}\)?[\s\-\.]?\d{3}[\s\-\.]?\d{4}'
patc = re.compile(pat)

for d in text:
    m = re.search(patc, d)
    if m:
        print(m.group())

1 123 456 7890
+1 (123) 456 7890
123-456-7890
123.456.7890
1234567890


### Grouping

In [136]:
text = ['1 123 456 7890', '+1 (123) 456 7890', '123-456-7890', '123.456.7890', '1234567890']
pat = r'(\+?\d?)\s?(\(?\d{3}\)?)[\s\-\.]?(\d{3})[\s\-\.]?(\d{4})'
patc = re.compile(pat)

for dt in text:
    m = re.search(patc, dt)
    if m:
        print(m.group(), "\t", m.group(1), "\t", m.group(2), "\t", m.group(3), "\t", m.group(4))

1 123 456 7890 	 1 	 123 	 456 	 7890
+1 (123) 456 7890 	 +1 	 (123) 	 456 	 7890
123-456-7890 	  	 123 	 456 	 7890
123.456.7890 	  	 123 	 456 	 7890
1234567890 	  	 123 	 456 	 7890


In [137]:
text = ['1 123 456 7890', '+1 (123) 456 7890', '123-456-7890', '123.456.7890', '1234567890']
pat = r'(?P<add1>\+?\d?)\s?(\(?\d{3}\)?)[\s\-\.]?(\d{3})[\s\-\.]?(\d{4})'
patc = re.compile(pat)

for dt in text:
    m = re.search(patc, dt)
    if m:
        print(m.group(), "\t", m.group(1), "\t", m.group(2), "\t", m.group(3), "\t", m.group(4))

1 123 456 7890 	 1 	 123 	 456 	 7890
+1 (123) 456 7890 	 +1 	 (123) 	 456 	 7890
123-456-7890 	  	 123 	 456 	 7890
123.456.7890 	  	 123 	 456 	 7890
1234567890 	  	 123 	 456 	 7890


### Naming Groups

In [138]:
text = ['1 123 456 7890', '+1 (123) 456 7890', '123-456-7890', '123.456.7890', '1234567890']
pat = r'(?P<add1>\+?\d?)\s?(?P<area>\(?\d{3}\)?)[\s\-\.]?(?P<first3>\d{3})[\s\-\.]?(?P<last4>\d{4})'
patc = re.compile(pat)

for dt in text:
    m = re.search(patc, dt)
    if m:
        print(m.group(), "\t", m.group('add1'),"\t", m.group('area'),"\t",
        m.group('first3'),"\t", m.group('last4') )

1 123 456 7890 	 1 	 123 	 456 	 7890
+1 (123) 456 7890 	 +1 	 (123) 	 456 	 7890
123-456-7890 	  	 123 	 456 	 7890
123.456.7890 	  	 123 	 456 	 7890
1234567890 	  	 123 	 456 	 7890


#### Note: The course materials are developed mainly based on personal experience and contributions from the Python learning community
Referred Books: 
- Learning Python, 5th Edition by Mark Lutz
- Python Data Science Handbook, Jake, VanderPlas
- Python for Data Analysis, Wes McKinney 

Copyright ©2023 Mei Najim. All rights reserved.  