<h1 align="center">REGULAR EXPRESSIONS</h1>
<h2 align="left"><ins>Lesson Guide</ins></h2>

- [**`re` MODULE PATTERN SYNTAX**](#syntax)
    - [**Repetition Syntax**](#repetition)
    - [**Character Sets**](#sets)
    - [**Exclusion**](#exclusion)
    - [**Character Ranges**](#ranges)
    - [**Escape Codes**](#escapes)
- **[SEARCHING FOR PATTERNS IN TEXT**](#text)
    - [**`match()`**](#match)
    - [**`search()`**](#search)
    - [**`finditer()`**](#finditer)
    - [**`compile()`**](#compile)
    - [**`fullmatch()`**](#fullmatch)
    - [**`findall()`**](#findall)
    - [**`sub`**](#sub)
    - [**`subn`**](#subn)
    - [**OR | in regex**](#or)
- [**SPLIT WITH REGULAR EXPRESSIONS**](#split)
- [**MORE EXAMPLES**](#examples)


### Documentation
https://docs.python.org/3/library/re.html<br> 
https://docs.python.org/3/howto/regex.html

<ins>**Additional Resources**</ins><br>
https://www.tutorialspoint.com/python3/python_reg_expressions.htm<br>
[Tutorial website - RegexOne](https://regexone.com/)<br>
[Regex101 (personal favourite)](https://regex101.com/)

Regular expressions are text-matching patterns described with a formal syntax. You'll often hear regular expressions referred to as 'regex' or 'regexp' in conversation. Regular expressions can include a variety of rules, from finding repetition, to text-matching, and much more. As you advance in Python you'll see that a lot of your parsing problems can be solved with regular expressions (they're also a common interview question).

||||||
|-:|:-|:-:|-:|:-|
|abc…|Letters||{m}|m Repetitions|
|123…|Digits||{m,n}|m to n Repetitions|
|\d|Any Digit||*|Zero or more repetitions|
|\D|Any Non-digit character||+|One or more repetitions|
|.|Any Character||?|Optional character|
|\\.|Period||\s|Any Whitespace|
|[abc]|Only a, b, or c||\S|Any Non-whitespace character|
|[^abc]|Not a, b, nor c||^…$|Starts and ends|
|[a-z]|Characters a to z||(…)|Capture Group|
|[0-9]|Numbers 0 to 9||(a(bc))|Capture Sub-group|
|\w|Any Alphanumeric characte||(.*)|Capture all|
|\W|Any Non-alphanumeric character||(abc|def)|Matches abc or def|

If you're familiar with Perl, you'll notice that the syntax for regular expressions are very similar in Python. We will be using the <code>re</code> module with Python.

In [1]:
import re

In [2]:
print(dir(re))

['A', 'ASCII', 'DEBUG', 'DOTALL', 'I', 'IGNORECASE', 'L', 'LOCALE', 'M', 'MULTILINE', 'Match', 'Pattern', 'RegexFlag', 'S', 'Scanner', 'T', 'TEMPLATE', 'U', 'UNICODE', 'VERBOSE', 'X', '_MAXCACHE', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '__version__', '_cache', '_compile', '_compile_repl', '_expand', '_locale', '_pickle', '_special_chars_map', '_subx', 'compile', 'copyreg', 'enum', 'error', 'escape', 'findall', 'finditer', 'fullmatch', 'functools', 'match', 'purge', 'search', 'split', 'sre_compile', 'sre_parse', 'sub', 'subn', 'template']


<a id='syntax'></a>
## `re` MODULE PATTERN SYNTAX
Regular expressions support a huge variety of patterns beyond just simply finding where a single string occurred. We can use *metacharacters* along with `re` to find specific types of patterns. 

Since we will be testing multiple re syntax forms, let's create a function that will print out results given a list of various regular expressions and a phrase to parse:

In [3]:
def multi_re_find(patterns,phrase):
    '''
    Takes in a list of regex patterns
    Prints a list of all matches
    '''
    for pattern, explanation in patterns.items():
        print(f'Searching the phrase with re = {pattern} ({explanation}).')
        print(f'Using the findall method, the list of patterns found in the phrase are:\n{re.findall(pattern,phrase)}\n')

<a id='repetition'></a>
### <ins>Repetition Syntax</ins>

There are five ways to express repetition in a pattern:

   1. A pattern followed by the meta-character <code>*</code> is repeated zero or more times. 
   2. Replace the <code>*</code> with <code>+</code> and the pattern must appear at least once. 
   3. Using <code>?</code> means the pattern appears zero or one time. 
   4. For a specific number of occurrences, use <code>{m}</code> after the pattern, where **m** is replaced with the number of times the pattern should repeat. 
   5. Use <code>{m,n}</code> where **m** is the minimum number of repetitions and **n** is the maximum. Leaving out **n** <code>{m,}</code> means the value appears at least **m** times, with no maximum.
    
Now we will see an example of each of these using our multi_re_find function:

In [4]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

test_patterns = {'sd*':"s followed by zero or more d's",
                 'sd+':"s followed by one or more d's",
                 'sd?':"s followed by zero or one d's",
                 'sd{3}':"s followed by three d's",
                 'sd{2,3}':"s followed by two to three d's"}

multi_re_find(test_patterns,test_phrase)

Searching the phrase with re = sd* (s followed by zero or more d's).
Using the findall method, the list of patterns found in the phrase are:
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sd', 's', 's', 's', 's', 's', 's', 'sdddd']

Searching the phrase with re = sd+ (s followed by one or more d's).
Using the findall method, the list of patterns found in the phrase are:
['sd', 'sd', 'sddd', 'sddd', 'sddd', 'sd', 'sdddd']

Searching the phrase with re = sd? (s followed by zero or one d's).
Using the findall method, the list of patterns found in the phrase are:
['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd']

Searching the phrase with re = sd{3} (s followed by three d's).
Using the findall method, the list of patterns found in the phrase are:
['sddd', 'sddd', 'sddd', 'sddd']

Searching the phrase with re = sd{2,3} (s followed by two to three d's).
Using the findall method, the list of patterns found in the phrase are:
['sddd', 'sddd', 'sddd', 'sddd']



<a id='sets'></a>
### <ins>Character Sets</ins>
Character sets are used when you wish to match any one of a group of characters at a point in the input. Brackets are used to construct character set inputs. For example: the input <code>[ab]</code> searches for occurrences of either **a** or **b**.

In [5]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

test_patterns = {'[sd]':"either the letter s or d",
                 's[sd]+':"s followed by one or more s or d letters"} 

multi_re_find(test_patterns,test_phrase)

Searching the phrase with re = [sd] (either the letter s or d).
Using the findall method, the list of patterns found in the phrase are:
['s', 'd', 's', 'd', 's', 's', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd', 's', 'd', 's', 'd', 's', 's', 's', 's', 's', 's', 'd', 'd', 'd', 'd']

Searching the phrase with re = s[sd]+ (s followed by one or more s or d letters).
Using the findall method, the list of patterns found in the phrase are:
['sdsd', 'sssddd', 'sdddsddd', 'sds', 'sssss', 'sdddd']



It makes sense that the first input <code>[sd]</code> returns every instance of s or d. Also, the second input <code>s[sd]+</code> returns any full strings that begin with an s and continue with s or d characters until another character is reached.

In [6]:
test_phrase = "I enjoy learning programming languages such as Python 3"

test_patterns = {r"[a-z]":"any lower case letter in the alphabet", 
                 r"[A-Z]":"any upper case letter in the alphabet", 
                 r"[a-d]":"any of the letters a, b, c or d", 
                 r"[abn]":"only the letters a, b or n", 
                 r"[^a]":"anything that is not the letter a",
                 r"[0-9]":"any numerical number", 
                 r"[1-3]":"only the numbers 1, 2 or 3", 
                 r"[135]":"only the numbers 1, 3 or 5", 
                 r"[^3]":"anything that is not the number 3"}

multi_re_find(test_patterns,test_phrase)

Searching the phrase with re = [a-z] (any lower case letter in the alphabet).
Using the findall method, the list of patterns found in the phrase are:
['e', 'n', 'j', 'o', 'y', 'l', 'e', 'a', 'r', 'n', 'i', 'n', 'g', 'p', 'r', 'o', 'g', 'r', 'a', 'm', 'm', 'i', 'n', 'g', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e', 's', 's', 'u', 'c', 'h', 'a', 's', 'y', 't', 'h', 'o', 'n']

Searching the phrase with re = [A-Z] (any upper case letter in the alphabet).
Using the findall method, the list of patterns found in the phrase are:
['I', 'P']

Searching the phrase with re = [a-d] (any of the letters a, b, c or d).
Using the findall method, the list of patterns found in the phrase are:
['a', 'a', 'a', 'a', 'c', 'a']

Searching the phrase with re = [abn] (only the letters a, b or n).
Using the findall method, the list of patterns found in the phrase are:
['n', 'a', 'n', 'n', 'a', 'n', 'a', 'n', 'a', 'a', 'n']

Searching the phrase with re = [^a] (anything that is not the letter a).
Using the findall met

<a id='exclusion'></a>
### <ins>Exclusion</ins>
We can use <code>^</code> to exclude terms by incorporating it into the bracket syntax notation. For example: <code>[^...]</code> will match any single character not in the brackets. Let's see some examples:

In [7]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

Use <code>[^!.? ]</code> to check for matches that are not a !,.,?, or space. Add a <code>+</code> to check that the match appears at least once. This basically translates into finding the words.

In [8]:
test = re.findall('[^!.? ]+',test_phrase)
print(test)

['This', 'is', 'a', 'string', 'But', 'it', 'has', 'punctuation', 'How', 'can', 'we', 'remove', 'it']


In [9]:
' '.join(test)

'This is a string But it has punctuation How can we remove it'

<a id='ranges'></a>
### <ins>Character Ranges</ins>
As character sets grow larger, typing every character that should (or should not) match could become very tedious. A more compact format using character ranges lets you define a character set to include all of the contiguous characters between a start and stop point. The format used is <code>[start-end]</code>.

Common use cases are to search for a specific range of letters in the alphabet. For instance, <code>[a-f]</code> would return matches with any occurrence of letters between a and f. 

In [10]:
test_phrase = 'This is an example sentence. Lets see if we can find some letters.'

test_patterns={'[a-z]+':"one or more occurences of lower case letters",
               '[A-Z]+':"one or more occurences of upper case letters",
               '[a-zA-Z]+':"one or more occurences of lower or upper case letters",
               '[A-Z][a-z]+':"one upper case letter, followed by one or more occurences of lower case letters"} 
                
multi_re_find(test_patterns,test_phrase)

Searching the phrase with re = [a-z]+ (one or more occurences of lower case letters).
Using the findall method, the list of patterns found in the phrase are:
['his', 'is', 'an', 'example', 'sentence', 'ets', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']

Searching the phrase with re = [A-Z]+ (one or more occurences of upper case letters).
Using the findall method, the list of patterns found in the phrase are:
['T', 'L']

Searching the phrase with re = [a-zA-Z]+ (one or more occurences of lower or upper case letters).
Using the findall method, the list of patterns found in the phrase are:
['This', 'is', 'an', 'example', 'sentence', 'Lets', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']

Searching the phrase with re = [A-Z][a-z]+ (one upper case letter, followed by one or more occurences of lower case letters).
Using the findall method, the list of patterns found in the phrase are:
['This', 'Lets']



<a id='escapes'></a>
### <ins>Escape Codes</ins>

You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits, whitespace, and more. For example:

<table border="1" class="docutils">
<colgroup>
<col width="14%" />
<col width="86%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Code</th>
<th class="head">Meaning</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\d</span></tt></td>
<td>a digit</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\D</span></tt></td>
<td>a non-digit</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\s</span></tt></td>
<td>whitespace (tab, space, newline, etc.)</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\S</span></tt></td>
<td>non-whitespace</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\w</span></tt></td>
<td>alphanumeric</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\W</span></tt></td>
<td>non-alphanumeric</td>
</tr>
</tbody>
</table>

Escapes are indicated by prefixing the character with a backslash <code>\\</code>. Unfortunately, a backslash must itself be escaped in normal Python strings, and that results in expressions that are difficult to read. Using raw strings, created by prefixing the literal value with <code>r</code>, eliminates this problem and maintains readability.

In [11]:
test_phrase = 'This is a string with some numbers 1233 and a symbol #hashtag.'

test_patterns = {r'\d+':"one or more occurences of digits",
                 r'\D+':"one or more occurences of non-digits",
                 r'\s+':"one or more occurences of whitespaces",
                 r'\S+':"one or more occurences of non-whitespaces",
                 r'\w+':"one or more occurences of alphanumeric characters",
                 r'\W+':"one or more occurences of non-alphanumeric characters"}

multi_re_find(test_patterns,test_phrase)

Searching the phrase with re = \d+ (one or more occurences of digits).
Using the findall method, the list of patterns found in the phrase are:
['1233']

Searching the phrase with re = \D+ (one or more occurences of non-digits).
Using the findall method, the list of patterns found in the phrase are:
['This is a string with some numbers ', ' and a symbol #hashtag.']

Searching the phrase with re = \s+ (one or more occurences of whitespaces).
Using the findall method, the list of patterns found in the phrase are:
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

Searching the phrase with re = \S+ (one or more occurences of non-whitespaces).
Using the findall method, the list of patterns found in the phrase are:
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', '#hashtag.']

Searching the phrase with re = \w+ (one or more occurences of alphanumeric characters).
Using the findall method, the list of patterns found in the phrase are:
['This', 'is',

<a id='text'></a>
## SEARCHING FOR PATTERNS IN TEXT

<a id='match'></a>
### <ins>`match()`</ins>

```python
re.match(pattern, string, flags=0)
```

Docstring:<br>
**Try to apply the pattern at the start of the string**, returning a Match object, or None if no match was found.

In [12]:
mystr = "You can learn any programming language, whether it is Python2, Python3, Perl, Java, javascript or PHP"

In [13]:
a = re.match('You', mystr)

print(a)    # output illustrates the match is found at index positions 0,1,2
print(a.group())

<re.Match object; span=(0, 3), match='You'>
You


In [14]:
a = re.match('abc', mystr)
print(a)
print(a.group())

None


AttributeError: 'NoneType' object has no attribute 'group'

In [15]:
# this method uses flags

a = re.match('you', mystr, re.I)   # re.I implies we don't need to worry about uppercase
print(a)    # output illustrates the match is found at positions 0,1,2
print(a.group())

<re.Match object; span=(0, 3), match='You'>
You


<a id='search'></a>
### <ins>`search()`</ins>

```python
re.search(pattern, string, flags=0)
```

Docstring:<br>
**Scan through string looking for a match to the pattern**, returning a Match object, or None if no match was found.

One of the most common uses for the re module is for finding patterns in text.

In [16]:
# List of patterns to search for
patterns = ['term1', 'term2']

# Text to parse
text = 'This is a string with term1, but it does not have the other term.'

for pattern in patterns:
    print(f'Searching for "{pattern}" in:\n "{text}"\n')
    
    # Check for match
    if re.search(pattern,text):
        print('Match was found. \n')
    else:
        print('No Match was found.\n')

Searching for "term1" in:
 "This is a string with term1, but it does not have the other term."

Match was found. 

Searching for "term2" in:
 "This is a string with term1, but it does not have the other term."

No Match was found.



Now we've seen that <code>re.search()</code> will take the pattern, scan the text, and then return a **Match** object. If no pattern is found, **None** is returned. To give a clearer picture of this match object, check out the cell below:

In [17]:
# List of patterns to search for
pattern = 'term1'

# Text to parse
text = 'This is a string with term1, but it does not have the other term.'

match = re.search(pattern,text)

print(type(match))
print(match)

<class 're.Match'>
<re.Match object; span=(22, 27), match='term1'>


This **Match** object returned by the search() method is more than just a Boolean or None, it contains information about the match, including the original input string, the regular expression that was used, and the location of the match. Let's see the methods we can use on the match object:

In [18]:
print(match.string)
print(match.re)
print(match.regs)

This is a string with term1, but it does not have the other term.
re.compile('term1')
((22, 27),)


In [19]:
# Show start of match
print(match.start())

# Show end
print(match.end())

22
27


Let's look at another example.

In [20]:
arp = "22.22.22.1   0     b4:a9:5a:ff:c8:45 VLAN#222       L"

b = re.search(r"(.+?) +(\d) +(.+?)\s{2,}(\w)*", arp)
print(b)
print(b.group(1))
print(b.group(2))
print(b.group(3))
print(b.group(4))

<re.Match object; span=(0, 53), match='22.22.22.1   0     b4:a9:5a:ff:c8:45 VLAN#222    >
22.22.22.1
0
b4:a9:5a:ff:c8:45 VLAN#222
L


In [21]:
# difference here is no ? in the first () in the search method
b = re.search(r"(.+) +(\d) +(.+?)\s{2,}(\w)*", arp)
print(b)
print(b.group())
print(b.group(1))

# this is because \d indicates that python is expecting a single digit (from 0 - 9)
print(b.group(2))

# \s matches any whitespace character
# {2,} means python should expect 2 or more occurences
print(b.group(3))

print(b.group(4))

<re.Match object; span=(0, 53), match='22.22.22.1   0     b4:a9:5a:ff:c8:45 VLAN#222    >
22.22.22.1   0     b4:a9:5a:ff:c8:45 VLAN#222       L
22.22.22.1  
0
b4:a9:5a:ff:c8:45 VLAN#222
L


In [22]:
print(b.group())
print(b.group(0))

22.22.22.1   0     b4:a9:5a:ff:c8:45 VLAN#222       L
22.22.22.1   0     b4:a9:5a:ff:c8:45 VLAN#222       L


In [23]:
# changing to groups turns everything into a tuple - which could be used for unpacking
b.groups()

('22.22.22.1  ', '0', 'b4:a9:5a:ff:c8:45 VLAN#222', 'L')

<a id='finditer'></a>
### <ins>`finditer()`</ins>

```python
re.finditer(pattern, string, flags=0)
```

Docstring:<br>
**Return an iterator over all non-overlapping matches in the string. For each match, the iterator returns a Match object.**

In [24]:
text = "The agent's phone number is 408-555-1234. Call soon!"

Using python code we can perform a quick search:

In [25]:
'phone' in text

True

But let's show the format for regular expressions, because later on we will be searching for patterns that won't have such a simple solution.

In [26]:
# Using match() (which only checks the beginning of the string)

patterns = ['The', 'phone']

for pattern in patterns:
    if re.match(pattern,text):
        print(f'Using the search pattern {pattern}, Match found something - {re.match(pattern,text)}')
    else:
        print(f'No matches found using the pattern {pattern}.')

Using the search pattern The, Match found something - <re.Match object; span=(0, 3), match='The'>
No matches found using the pattern phone.


In [27]:
# Using search() (which checks the entire string)

patterns = ["NOT IN TEXT", 'phone']

for pattern in patterns:
    if re.search(pattern,text):
        print(f'Using the search pattern {pattern}, Search found something - {re.search(pattern,text)}')
    else:
        print(f'No matches found using the pattern {pattern}.')

No matches found using the pattern NOT IN TEXT.
Using the search pattern phone, Search found something - <re.Match object; span=(12, 17), match='phone'>


Let's take a closer look at this Match object.

In [28]:
pattern = 'phone'
match = re.search(pattern,text)
match

<re.Match object; span=(12, 17), match='phone'>

In [29]:
print(match.span())
print(match.start())
print(match.end())

(12, 17)
12
17


But what if the pattern occurs more than once?

In [30]:
text = "my phone is a new phone"
match = re.search("phone",text)
match.span()  # only returns the first occurence

(3, 8)

Notice it only matches the first instance. If we wanted a list of all matches, we can use the `findall()` method.

In [31]:
matches = re.findall("phone",text)
print(matches)
print(len(matches))

['phone', 'phone']
2


To get actual match objects, use the iterator:

In [32]:
for match in re.finditer("phone",text):
    print(match)

<re.Match object; span=(3, 8), match='phone'>
<re.Match object; span=(18, 23), match='phone'>


In [33]:
for match in re.finditer("phone",text):
    print(match.span())

(3, 8)
(18, 23)


If you wanted the actual text that matched, you can use the .group() method (which is what is returned when using the `.findall`.

In [34]:
for match in re.finditer("phone",text):
    print(match.group())

phone
phone


The following was compiled using the code generator in https://regex101.com/.

In [35]:
regex = r"p[a-z]+"

test_str = "my phone is a new phone"

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        
        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

Match 1 was found at 3-8: phone
Match 2 was found at 18-23: phone


In [36]:
for match in re.finditer(r"p[a-z]+",text):
    print(match.group())

phone
phone


<a id='compile'></a>
### <ins>`compile()`</ins>

In [37]:
text = "My telephone number is 408-555-1234"

In [38]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d',text)

print(phone.group())
print(phone.group(0))

408-555-1234
408-555-1234


Notice the repetition of \d. That is a bit of an annoyance, especially if we are looking for very long strings of numbers. Let's rewrite our pattern using quantifiers:

In [39]:
re.search(r'\d{3}-\d{3}-\d{4}',text)

<re.Match object; span=(23, 35), match='408-555-1234'>

What if we wanted to do two tasks, find phone numbers, but also be able to quickly extract their area code (the first three digits). We can use groups for any general task that involves grouping together regular expressions (so that we can later break them down). 

Using the phone number example, we can separate groups of regular expressions using parenthesis:

In [40]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
results = re.search(phone_pattern,text)

In [41]:
# The entire result
results.group()

'408-555-1234'

In [42]:
# Can then also call by group position.
# Something to note is that group ordering starts at 1. Passing in 0 returns everything
results.group(1)

'408'

In [43]:
results.group(2)

'555'

In [44]:
results.group(3)

'1234'

In [45]:
# We only had three groups of parenthesis
results.group(4)

IndexError: no such group

Let's look at another example involving email addresses.

In [46]:
# email validation

pattern = re.compile(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)")
string = 'deanw@gmail.com'
string2 = 'bob'

a = pattern.search(string)
b = pattern.search(string2)
print(a)
print(b)

<re.Match object; span=(0, 15), match='deanw@gmail.com'>
None


In [47]:
multi_re_find({"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)":"search for email address"},string)

Searching the phrase with re = (^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$) (search for email address).
Using the findall method, the list of patterns found in the phrase are:
['deanw@gmail.com']



In [48]:
multi_re_find({"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)":"search for email address"},string2)

Searching the phrase with re = (^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$) (search for email address).
Using the findall method, the list of patterns found in the phrase are:
[]



<a id='fullmatch'></a>
### <ins>`fullmatch()`</ins>

In [49]:
# password checker: at least 8 chars long, contain letters, numbers and $%#@, end with a number

pattern2 = re.compile(r"[a-zA-z0-9$%#@]{7,}\d$")
password = 'isdfuj$#dof54'
password2 = 'dtg%#$$dfg'

check = pattern2.fullmatch(password)
check2 = pattern2.fullmatch(password2)
print(check)
print(check2)

<re.Match object; span=(0, 13), match='isdfuj$#dof54'>
None


In [50]:
multi_re_find({"[a-zA-z0-9$%#@]{7,}\d$":'password checker based on above rules'},password)

Searching the phrase with re = [a-zA-z0-9$%#@]{7,}\d$ (password checker based on above rules).
Using the findall method, the list of patterns found in the phrase are:
['isdfuj$#dof54']



In [51]:
multi_re_find({"[a-zA-z0-9$%#@]{7,}\d$":'password checker based on above rules'},password2)

Searching the phrase with re = [a-zA-z0-9$%#@]{7,}\d$ (password checker based on above rules).
Using the findall method, the list of patterns found in the phrase are:
[]



<a id='findall'></a>
### <ins>`findall()`</ins>

We can use <code>re.findall()</code> to find all the instances of a pattern in a string. For example:

`re.findall(pattern, string, flags=0)`

Docstring:<br>
**Return a list of all non-overlapping matches in the string.**

If one or more capturing groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

Empty matches are included in the result.

In [52]:
# Returns a list of all matches
re.findall('match','test phrase match is in middle and a match at the end')

['match', 'match']

In [53]:
len(re.findall('match','test phrase match is in middle and a match at the end'))

2

In [54]:
arp = "22.22.22.1   0     b4:a9:5a:ff:c8:45 VLAN#222       L"

a = re.findall(r"\d\d\.\d{2}\.[0-9][0-9]\.[0-9]{1,3}",arp)

# \d\d - python should expect 2 consecutive digits
# \. - 'character escaping' - to match the actual dot in the string we need to use the \ since . on its own has a different meaning
# \d{2} - implies that the previous character (which in this case is a digit from 0-9) should occur 2 times only
# \. - same as before
# [0-9][0-9] - set of characters which defines a range of characters
# \.
# [0-9]{1,3} - this implies that the set of characters will occur between 1-3 times. 
a

['22.22.22.1']

In [55]:
# this will return a list of tuples
b = re.findall(r"(\d\d)\.(\d{2})\.([0-9][0-9])\.([0-9]{1,3})",arp)
b

[('22', '22', '22', '1')]

In [56]:
# suppose we have more than one ip address in the string:
arp = "22.22.22.1   0     b4:a9:5a:ff:c8:45 VLAN#222   10.10.10.10    L"

a = re.findall(r"\d\d\.\d{2}\.[0-9][0-9]\.[0-9]{1,3}",arp)
print(a)

b = re.findall(r"(\d\d)\.(\d{2})\.([0-9][0-9])\.([0-9]{1,3})",arp)
print(b)

['22.22.22.1', '10.10.10.10']
[('22', '22', '22', '1'), ('10', '10', '10', '10')]


<a id='sub'></a>
### <ins>`sub()`</ins>

```python
re.sub(pattern, repl, string, count=0, flags=0)
```

Docstring:<br>
**Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl.** repl can be either a string or a callable; if a string, backslash escapes in it are processed. If it is a callable, it's passed the Match object and must return a replacement string to be used.

In [57]:
arp = "22.22.22.1   0     b4:a9:5a:ff:c8:45 VLAN#222       L"

# suppose we change all the digits with a 7:
a = re.sub(r"\d", '7', arp)
a

'77.77.77.7   7     b7:a7:7a:ff:c7:77 VLAN#777       L'

<a id='subn'></a>
### <ins>`subn()`</ins>

```python
re.subn(pattern, repl, string, count=0, flags=0)
```

Docstring:<br>
**Return a 2-tuple containing (new_string, number).** `new_string` is the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in the source string by the replacement repl. `number` is the number of substitutions that were made. `repl` can be either a string or a callable; if a string, backslash escapes in it are processed. If it is a callable, it's passed the Match object and must return a replacement string to be used.

In [58]:
a = "I enjoy learning programming languages such as Python 3"

re.subn(r"\s", "_", a)

# 8 refers to the number of changes made

('I_enjoy_learning_programming_languages_such_as_Python_3', 8)

<a id='or'></a>
### <ins>OR | in regex</ins>
Use the pipe operator to have an or statment. For example:

In [59]:
re.search(r"man|woman","This man was here.")

<re.Match object; span=(5, 8), match='man'>

In [60]:
re.search(r"man|woman","This woman was here.")

<re.Match object; span=(5, 10), match='woman'>

In [61]:
a = "I enjoy learning programming languages such as Python 3"

result = re.search(r"\W(\w{20})\W|([A-Z]\w{5})\s\d", a)
print(result)
print(result.group())
print('*' * 50)
# first part - a 2 worded sequence surrounded by whitespace
#  or
# second part - a 5 worded sequence following a capital letter followed by a space and a digit

print(result.group(0))
print(result.group(1))
print(result.group(2))

<re.Match object; span=(47, 55), match='Python 3'>
Python 3
**************************************************
Python 3
None
Python


<a id='split'></a>
## SPLIT WITH REGULAR EXPRESSIONS

Let's see how we can split with the re syntax. This should look similar to how you used the split() method with strings in python.

```python
re.split(pattern, string, maxsplit=0, flags=0)
```

Docstring:
**Split the source string by the occurrences of the pattern, returning a list containing the resulting substrings.** If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.

In [62]:
# Term to split on
split_term = '@'

phrase = 'What is the domain name of someone with the email: hello@gmail.com'

# Split the phrase
re.split(split_term,phrase)

['What is the domain name of someone with the email: hello', 'gmail.com']

Note how <code>re.split()</code> returns a list with the term to split on removed and the terms in the list are a split up version of the string.

In [63]:
a = "I enjoy learning programming languages such as Python 3"

# python method
print(a.split(" "))

# regex method
print(re.split(r" ", a))

['I', 'enjoy', 'learning', 'programming', 'languages', 'such', 'as', 'Python', '3']
['I', 'enjoy', 'learning', 'programming', 'languages', 'such', 'as', 'Python', '3']


In [64]:
# split the string by a 2 character word that has white space at the beginning and end of the word
re.split(r"\W\w{2}\W", a)

['I enjoy learning programming languages such', 'Python 3']

<a id='examples'></a>
### <ins>MORE EXAMPLES</ins>

In [65]:
my_regex_str = "200.10.2.0 255.255.255.0 200.20.5.2 1 205 T#1 S IB 5"

a = re.match(r"255", my_regex_str)
print(a)
type(a)

# this returns None because the match method matches patterns only at the beginning of the string.

None


NoneType

In [66]:
b = re.search(r"(.+?) +\d\d(\d)\.([0-9]{2,})\.([0-9]{1,3})\.(\d) +(.+)1 +(\d{3}) +(\w{1})#.+S(\s+)(\w)\w +(.*)", my_regex_str)

print(b)
print(b.group(0))   # always returns the original string

<re.Match object; span=(0, 52), match='200.10.2.0 255.255.255.0 200.20.5.2 1 205 T#1 S I>
200.10.2.0 255.255.255.0 200.20.5.2 1 205 T#1 S IB 5


In [67]:
print(b.group(1))

# Inside the first group, we're matching any character, except a new line character. The plus sign means that
# the previous expression, represented by the dot, may repeat one or more times. The match is made in a minimal
# fashion (non-greedy), matching as few characters as possible up tot he first Space character. The result is 
# the IP address.

200.10.2.0


In [68]:
# Challenge1 
regex_str = "123.456.789   0 PYTHON 3"

# method 1
regex = re.search(r"((\d+)\.(\d+)\.(\d+))\s{2,}", regex_str)

print(regex)
print(regex.group(1))

# method 2
regex = re.search(r"((\d\d\d)\.(\d+)\.(\d{1,3}))\s{2,}", regex_str)

print(regex)
print(regex.group(1))

# method 3
regex = re.search(r"(.+?)\s{2,}", regex_str)

print(regex)
print(regex.group(1))

# method 4
regex = re.search(r"(.+)\s{2,}", regex_str)

print(regex)
print(regex.group(1))

<re.Match object; span=(0, 14), match='123.456.789   '>
123.456.789
<re.Match object; span=(0, 14), match='123.456.789   '>
123.456.789
<re.Match object; span=(0, 14), match='123.456.789   '>
123.456.789
<re.Match object; span=(0, 14), match='123.456.789   '>
123.456.789 


In [69]:
# Challenge2 

# method 1
regex_str = "123.456.789   0 PYTHON 3"

regex = re.sub(r"\d|[A-Z]", "%", regex_str)

print(regex)

# method 2

regex_str = "123.456.789   0 PYTHON 3"

regex = re.sub(r"\w", "%", regex_str)

print(regex)

%%%.%%%.%%%   % %%%%%% %
%%%.%%%.%%%   % %%%%%% %
