<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Searching-for-Patterns-in-Text" data-toc-modified-id="Searching-for-Patterns-in-Text-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Searching for Patterns in Text</a></span></li><li><span><a href="#Split-with-regular-expressions" data-toc-modified-id="Split-with-regular-expressions-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Split with regular expressions</a></span></li><li><span><a href="#Finding-All-Instances-Of-A-Pattern" data-toc-modified-id="Finding-All-Instances-Of-A-Pattern-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Finding All Instances Of A Pattern</a></span></li><li><span><a href="#Pattern-re-Syntax" data-toc-modified-id="Pattern-re-Syntax-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Pattern re Syntax</a></span><ul class="toc-item"><li><span><a href="#Repetition-Syntax" data-toc-modified-id="Repetition-Syntax-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Repetition Syntax</a></span></li><li><span><a href="#Character-Sets" data-toc-modified-id="Character-Sets-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Character Sets</a></span></li><li><span><a href="#Exclusion" data-toc-modified-id="Exclusion-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Exclusion</a></span></li><li><span><a href="#Character-Ranges" data-toc-modified-id="Character-Ranges-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Character Ranges</a></span></li><li><span><a href="#Escape-Codes" data-toc-modified-id="Escape-Codes-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>Escape Codes</a></span></li></ul></li></ul></div>

# `re` module

- Regular Expressions, also referred to as 'regex' or 'regexp' in conversation
- Include a variety of rules, for finding repetition, to text-matching, and much more
- A lot of your parsing problems can be solved with regular expressions
- They're also a common interview question!

## Searching for Patterns in Text

In [1]:
import re

# List of patterns to search for
patterns = ['term1', 'term2']

# Text to parse
text = 'This is a string with term1, but it does not have the other term.'

for patternToSearch in patterns:
    print('Searching for "{0}" in: "{1}"\n...'.format(patternToSearch, text))

    # Check for match: re.search(pattern, text)
    if re.search(patternToSearch, text):
        print('Match was found: ')
        print(re.search(patternToSearch, text))
        print('\n')
    else:
        print('No Match was found: ')
        print(re.search(patternToSearch, text))
        print('\n')

Searching for "term1" in: "This is a string with term1, but it does not have the other term."
...
Match was found: 
<re.Match object; span=(22, 27), match='term1'>


Searching for "term2" in: "This is a string with term1, but it does not have the other term."
...
No Match was found: 
None




- `re.search()`: Take the pattern, scan the text, and then:
  - Returns a **`Match`** object.
  - If no pattern is found, a **`None`** is returned

In [2]:
# List of patterns to search for
pattern = 'term1'

# Text to parse
text = 'This is a string with term1, but it does not have the other term.'

match = re.search(pattern, text)
print(type(match))

<class 're.Match'>


- This **Match** object returned by the `search()` method is more than just a Boolean or None
- Has information about the match, including the original input string, the regular expression that was used, and the location of the match

In [3]:
print(help(match))

# Show start of match
print(match.start())

# Show end
print(match.end())

Help on Match object:

class Match(builtins.object)
 |  The result of re.match() and re.search().
 |  Match objects always have a boolean value of True.
 |  
 |  Methods defined here:
 |  
 |  __copy__(self, /)
 |  
 |  __deepcopy__(self, memo, /)
 |  
 |  __getitem__(self, key, /)
 |      Return self[key].
 |  
 |  __repr__(self, /)
 |      Return repr(self).
 |  
 |  end(self, group=0, /)
 |      Return index of the end of the substring matched by group.
 |  
 |  expand(self, /, template)
 |      Return the string obtained by doing backslash substitution on the string template, as done by the sub() method.
 |  
 |  group(...)
 |      group([group1, ...]) -> str or tuple.
 |      Return subgroup(s) of the match by indices or names.
 |      For 0 returns the entire match.
 |  
 |  groupdict(self, /, default=None)
 |      Return a dictionary containing all the named subgroups of the match, keyed by the subgroup name.
 |      
 |      default
 |        Is used for groups that did not par

## Split with regular expressions

- `re.split()` returns a list with the term to spit on removed
- The terms in the list are a split up version of the string

In [4]:
# Term to split on
split_at_term = ' '  # Each space
phrase = 'What is the domain name of someone with the email: hello@gmail.com'

# Split the phrase
print(re.split(split_at_term, phrase))

['What', 'is', 'the', 'domain', 'name', 'of', 'someone', 'with', 'the', 'email:', 'hello@gmail.com']


## Finding All Instances Of A Pattern

- Use `re.findall()` to find all the instances of a pattern in a string

In [5]:
# Returns a list of all matches
cases = re.findall('match', 'test phrase match is in middle and match is close to the end.')
print(cases)

['match', 'match']


In [6]:
# Function to count the appearance of a specific text within a long string
count_appearance_in_text = lambda toFind, longText: len(re.findall(toFind, longText))

toFind = 'match'
text = 'test phrase match is in middle and match is close to the end. And here are more match, match, match!! math!'
print(count_appearance_in_text(toFind, text))

5


## Pattern re Syntax

- We can use *`metacharacters`* along with `re` to find specific types of patterns

In [7]:
def multi_re_find(patterns, phrase):
    '''
    Takes in a list of regex patterns.
    Prints a list of all matches.
    '''
    for pattern in patterns:
        print('Searching the phrase using the re check: {0}'.format(pattern))
        print(re.findall(pattern, phrase))
        print('\n')

### Repetition Syntax

- There are five ways to express repetition in a pattern
  - A pattern followed by the meta-character `*` is repeated *zero or more times*
  - Replace the `*` with `+` and the pattern must appear *at least once or more time*
  - Using `?` means the pattern appears *zero or one time*
  - For a specific number of occurrences, use `{m}` after the pattern, where `m` is replaced with the number of times the pattern should repeat
  - Use `{m,n}` where `m` is the minimum number of repetitions and `n` is the maximum. Leaving out `n` (`{m,}`) means the value appears at least `m` times, with no maximum

In [8]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsdds...dsssss...sddddddd'
test_patterns = [
    'sd*',  # s followed by zero or more d's: [0, ...]
    'sd+',  # s followed by one or more d's: [1, ...]
    'sd?',  # s followed by zero or one d's: [0, 1]
    'sd{3}',  # s followed by n d's: [n]
    'sd{2,3}',  # s followed by m to n d's: [m, n]
    'sd{3,}',  # s followed by m or more d's: [m, ...]
    'sd{,3}',  # s followed by zero or at most m d's: [0, ..., m]
]

multi_re_find(test_patterns, test_phrase)

Searching the phrase using the re check: sd*
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sdd', 's', 's', 's', 's', 's', 's', 'sddddddd']


Searching the phrase using the re check: sd+
['sd', 'sd', 'sddd', 'sddd', 'sddd', 'sdd', 'sddddddd']


Searching the phrase using the re check: sd?
['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd']


Searching the phrase using the re check: sd{3}
['sddd', 'sddd', 'sddd', 'sddd']


Searching the phrase using the re check: sd{2,3}
['sddd', 'sddd', 'sddd', 'sdd', 'sddd']


Searching the phrase using the re check: sd{3,}
['sddd', 'sddd', 'sddd', 'sddddddd']


Searching the phrase using the re check: sd{,3}
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sdd', 's', 's', 's', 's', 's', 's', 'sddd']




### Character Sets

- Used when you wish to match any one of a group of characters at a point in the input
- Brackets are used to construct character set inputs

In [9]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'
test_patterns = [
    '[sd]',  # either s or d
    's[sd]+' # s followed by one or more s or d
]

multi_re_find(test_patterns, test_phrase)

Searching the phrase using the re check: [sd]
['s', 'd', 's', 'd', 's', 's', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd', 's', 'd', 's', 'd', 's', 's', 's', 's', 's', 's', 'd', 'd', 'd', 'd']


Searching the phrase using the re check: s[sd]+
['sdsd', 'sssddd', 'sdddsddd', 'sds', 'sssss', 'sdddd']




### Exclusion

- Use `^` to exclude terms by incorporating it into the bracket syntax notation
- Use `[^!.? ]` to check for matches that are not either a `!`,`.`,`?`, or `space`
- Add the `+` to check that the match appears at least once, this basically translate into finding the word

In [10]:
test_phrase = 'This is a string! But it has punctuation. How can we remove them?'
print(re.findall('[^!.? ]+', test_phrase))

['This', 'is', 'a', 'string', 'But', 'it', 'has', 'punctuation', 'How', 'can', 'we', 'remove', 'them']


### Character Ranges

- As character sets grow larger, typing every character that should (or should not) match could become very tedious
- A more compact format using character ranges lets you define a character set to include all of the contiguous characters between a start and stop point
- The format used is `[start-end]`
- Common use cases are to search for a specific range of letters in the alphabet

In [11]:
test_phrase = 'ThiS is an exAmple senTence. Lets sEe if we can fInd soMe letters.'
test_patterns = [
    '[a-z]+',  # sequences of lower case letters
    '[A-Z]+',  # sequences of upper case letters
    '[a-zA-Z]+',  # sequences of lower or upper case letters
    '[A-Z][a-z]+' # one upper case letter followed by lower case letters
]

multi_re_find(test_patterns, test_phrase)

Searching the phrase using the re check: [a-z]+
['hi', 'is', 'an', 'ex', 'mple', 'sen', 'ence', 'ets', 's', 'e', 'if', 'we', 'can', 'f', 'nd', 'so', 'e', 'letters']


Searching the phrase using the re check: [A-Z]+
['T', 'S', 'A', 'T', 'L', 'E', 'I', 'M']


Searching the phrase using the re check: [a-zA-Z]+
['ThiS', 'is', 'an', 'exAmple', 'senTence', 'Lets', 'sEe', 'if', 'we', 'can', 'fInd', 'soMe', 'letters']


Searching the phrase using the re check: [A-Z][a-z]+
['Thi', 'Ample', 'Tence', 'Lets', 'Ee', 'Ind', 'Me']




### Escape Codes

- Use special escape codes to find specific types of patterns in your data, such as digits, non-digits,whitespace, and more
- Indicated by prefixing the character with a backslash (`\`
- A backslash must itself be escaped in normal Python strings, and that results in expressions that are difficult to read
- Using raw strings, created by prefixing the literal value with `r`, for creating regular expressions eliminates this problem and maintains readability

<table border="1" class="docutils">
  <colgroup>
    <col width="14%" />
    <col width="86%" />
  </colgroup>
  <thead valign="bottom">
    <tr class="row-odd"><th class="head">Code</th>
      <th class="head">Meaning</th>
    </tr>
  </thead>
  <tbody valign="top">
    <tr class="row-even"><td><tt class="docutils literal"><span class="pre">\d</span></tt></td>
      <td>a digit</td>
    </tr>
    <tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\D</span></tt></td>
      <td>a non-digit</td>
    </tr>
    <tr class="row-even"><td><tt class="docutils literal"><span class="pre">\s</span></tt></td>
      <td>whitespace (tab, space, newline, etc.)</td>
    </tr>
    <tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\S</span></tt></td>
      <td>non-whitespace</td>
    </tr>
    <tr class="row-even"><td><tt class="docutils literal"><span class="pre">\w</span></tt></td>
      <td>alphanumeric</td>
    </tr>
    <tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\W</span></tt></td>
      <td>non-alphanumeric</td>
    </tr>
  </tbody>
</table>

In [12]:
test_phrase = 'This is a string with some numbers 1233 and a symbol #hashtag'
test_patterns = [
    r'\d+',  # sequence of digits
    r'\D+',  # sequence of non-digits
    r'\s+',  # sequence of whitespace
    r'\S+',  # sequence of non-whitespace
    r'\w+',  # alphanumeric characters
    r'\W+',  # non-alphanumeric
]

multi_re_find(test_patterns, test_phrase)

Searching the phrase using the re check: \d+
['1233']


Searching the phrase using the re check: \D+
['This is a string with some numbers ', ' and a symbol #hashtag']


Searching the phrase using the re check: \s+
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


Searching the phrase using the re check: \S+
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', '#hashtag']


Searching the phrase using the re check: \w+
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', 'hashtag']


Searching the phrase using the re check: \W+
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' #']


