## Regular Expressions (Regex)
credit: 
- https://realpython.com/regex-python/
- https://people.computing.clemson.edu/~jmarty/courses/LinuxStuff/Python%20Regular%20Expressions%20with%20Examples%20-%20Linux%20Tutorials%20-%20Learn%20Linux%20Configuration.pdf

Regular expressions (regex) are powerful tools for matching text patterns in Python. Regex can be used to search for specific text strings, replace certain text strings, validate input, and more. Regex is an essential part of any programming language and Python has a strong regex library that is easy to use.

Regex functionality in Python resides in a module named `re`.

In [1]:
import re

Once the module is imported, you can search for patterns in strings using the re.search() function. This function takes two arguments:

- The pattern that we are trying to match
- The string which we are searching

`re.search(<regex>, <string>)`
    
Scans a string for a regex match.

In [34]:
s = 'foo123bar'
re.search('123', s) #returns a match object, span=(3, 6) indicates the portion of <string> in which the match was found.

<re.Match object; span=(3, 6), match='123'>

Cool! But `in` & `find` methods of `str` class was doing the same, no?

### Python Regex Metacharacters

The real power of regex matching in Python emerges when `<regex>` contains special characters called metacharacters. These have a unique meaning to the regex matching engine and vastly enhance the capability of the search.

The following table briefly summarizes all the metacharacters supported by the re module. Some characters serve more than one purpose:

| Character(s)  |	Meaning |
|---            |---         |
|.              |	Any character, except newline|
|^	            |   Start of string |
|$              |	End of a string |
|*              |	Any number of matches (0 or more)|
|+              |	1 or more matches|
|?              |	0 or 1 match|
|{n}            |	Exactly n matches|
|[a-c]          | One character of the selected range, in this case a,b,c |
|[A-Z]          | One character of the selected range, in this case A-Z |
|[0-9AF-Z]      | One character of the selected range, in this case 0-9, A, and F-Z |
|[^A-Za-z]      | Caret, One character outside of the selected range, in this case for example ‘1’ would qualify |
| a\|d            | One character out of the two (an alternative to using []), ‘a’ or ‘d’ |
|\              |	\ Escapes special characters |
|\s             | white spaces [ \t\n\r\f]  (tab,newline,carriage return,form feed)|
|\d             | digits [0-9]          |
|\w             | “word” chars [a-zA-Z0-9_] |
|\S             | non-spaces [^ \t\n\r\f\v] |
|\D             | non-digits [^0-9]         |
|\W             | non-word chars [^a-zA-Z0-9_] |
|<>             |	Creates a named group   |
|\b             | match to a word boundary |
|()             | Grouping|

### Simple Examples!

In [3]:
print(re.search('[a-z]', 'FOObar')) #  matches any lowercase alphabetic character between 'a' and 'z'

<re.Match object; span=(3, 4), match='b'>


In [4]:
print(re.search('[0-9][0-9]', 'foo123bar')) # matches a sequence of two digits

<re.Match object; span=(3, 5), match='12'>


In [5]:
print(re.search('[^0-9]', '12345foo')) # matches any character that isn’t a digit

<re.Match object; span=(5, 6), match='f'>


In [6]:
print(re.search('\w', '#(.a$@&'))

<re.Match object; span=(3, 4), match='a'>


In [7]:
print(re.search('\S', '  \n foo  \n  ')) # matches any character that isn’t whitespace

<re.Match object; span=(4, 5), match='f'>


In [8]:
print(re.search('.', 'foo.bar')) # matches the first character in the string 
print(re.search('\.', 'foo.bar')) # interpreted literally and matches the '.' at index 3

<re.Match object; span=(0, 1), match='f'>
<re.Match object; span=(3, 4), match='.'>


In [9]:
print(re.search('^foo', 'barfoo')) #'foo' must be present at the beginning

None


The `r` at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions.

For example: Note that in Python, `\b` is used as an escape sequence for the backspace character. To use it within a regular expression, we have to treat it as `raw` text.

In [28]:
print(re.search(r'\bbar', 'foo bar'))
print(re.search(r'\bbar', 'foo.bar'))
print(re.search(r'\bbar', 'foobar'))
print(re.search(r'foo\b', 'foo bar'))
print(re.search(r'foo\b', 'foo.bar'))
print(re.search(r'foo\b', 'foobar'))
print(re.search(r'\bbar\b', 'foo bar baz'))
print(re.search(r'\bbar\b', 'foo(bar)baz'))
print(re.search(r'\bbar\b', 'foobarbaz'))

<re.Match object; span=(4, 7), match='bar'>
<re.Match object; span=(4, 7), match='bar'>
None
<re.Match object; span=(0, 3), match='foo'>
<re.Match object; span=(0, 3), match='foo'>
None
<re.Match object; span=(4, 7), match='bar'>
<re.Match object; span=(4, 7), match='bar'>
None


In [36]:
print(re.search('foo-*bar', 'foobar'))
print(re.search('foo-*bar', 'foo-bar'))
print(re.search('foo-*bar', 'foo--bar'))

<re.Match object; span=(0, 6), match='foobar'>
<re.Match object; span=(0, 7), match='foo-bar'>
<re.Match object; span=(0, 8), match='foo--bar'>


In [12]:
print(re.search('foo-+bar', 'foobar'))
print(re.search('foo-+bar', 'foo--bar'))

None
<re.Match object; span=(0, 8), match='foo--bar'>


In [13]:
print(re.search('foo-?bar', 'foobar'))
print(re.search('foo-?bar', 'foo-bar'))
print(re.search('foo-?bar', 'foo--bar'))

<re.Match object; span=(0, 6), match='foobar'>
<re.Match object; span=(0, 7), match='foo-bar'>
None


In [14]:
print(re.match('foo[1-9]*bar', 'foo42bar'))
print(re.match('foo[1-9]?bar', 'foo42bar'))

<re.Match object; span=(0, 8), match='foo42bar'>
None


In [15]:
# greedy vs shortest possible match
print(re.search('<.*>', '%<foo> <bar> <baz>%')) # the quantifier metacharacters *, +, and ? are all greedy, meaning they produce the longest possible match
print(re.search('<.*?>', '%<foo> <bar> <baz>%')) # shortest possible match instead, using the non-greedy metacharacter sequence *?

<re.Match object; span=(1, 18), match='<foo> <bar> <baz>'>
<re.Match object; span=(1, 6), match='<foo>'>


In [16]:
print(re.search('x-{3}x', 'x--x'))
print(re.search('x-{3}x', 'x---x'))
print(re.search('x-{3}x', 'x----x'))

None
<re.Match object; span=(0, 5), match='x---x'>
None


a sequence with curly braces must fit one of the following patterns in which m and n are nonnegative integers:

- {m,n} match at least m times and at most n times
- {m,}  match at least m times
- {,n} match at most n times
Otherwise, it matches literally:

In [37]:
print(re.search('a{3,5}', 'aaaaaaaa'))

<re.Match object; span=(0, 5), match='aaaaa'>


In [41]:
print(re.search('(bar)+', 'foo barbar baz'))
print(re.search('bar+', 'foo barbar baz'))

<re.Match object; span=(4, 10), match='barbar'>
<re.Match object; span=(4, 7), match='bar'>


In [47]:
print(re.search(r'\\', 'foo\\bar'))

<re.Match object; span=(3, 4), match='\\'>


### Searching Functions

1. `re.search()`	Scans a string for a regex match (we have already seen this one). The function returns a match object if it finds a match and None otherwise

2. `re.match()` This is identical to re.search(), except that re.search() returns a match if `<regex>` matches anywhere in `<string>`, whereas re.match() returns a match only if `<regex>` matches at the beginning of `<string>`

difference between `search` and `match`:

In [20]:
print(re.search(r'\d+', '123foobar'))
print(re.search(r'\d+', 'foo123bar'))

print(re.match(r'\d+', '123foobar'))
print(re.match(r'\d+', 'foo123bar'))

<re.Match object; span=(0, 3), match='123'>
<re.Match object; span=(3, 6), match='123'>
<re.Match object; span=(0, 3), match='123'>
None


3. `re.fullmatch()` This is similar to re.search() and re.match(), but re.fullmatch() returns a match only if `<regex>` matches `<string>` in its entirety

In [48]:
print(re.fullmatch(r'\d+', '123foo'))
print(re.fullmatch(r'\d+', 'foo123'))
print(re.fullmatch(r'\d+', 'foo123bar'))
print(re.fullmatch(r'\d+', '123'))
print(re.search(r'^\d+$', '123'))

None
None
None
<re.Match object; span=(0, 3), match='123'>
<re.Match object; span=(0, 3), match='123'>


4. `re.findall()` returns a list of all non-overlapping matches of `<regex>` in `<string>`. It scans the search string from left to right and returns all matches in the order found:

In [22]:
print(re.findall(r'\w+', '...foo,,,,bar:%$baz//|'))

['foo', 'bar', 'baz']


In [49]:
re.findall(r'(\w+),(\w+)', 'foo,bar,baz,qux,quux,corge')

[('foo', 'bar'), ('baz', 'qux'), ('quux', 'corge')]

### Substitution Function

`re.sub(<regex>, <repl>, <string>)` finds the leftmost non-overlapping occurrences of `<regex>` in `<string>`, replaces each match as indicated by `<repl>`, and returns the result. `<string>` remains unchanged.

In [51]:
s = 'foo.123.bar.789.baz'
print(re.sub(r'\d+', '#', s))

foo.#.bar.#.baz
