# `re` module
---

**Table of Contents**<a id='toc0_'></a>    
- [Searching for Patterns in Text](#toc1_)    
- [Split With Regular Expressions](#toc2_)    
- [Finding All Instances Of A Pattern](#toc3_)    
- [Pattern `re` Syntax](#toc4_)    
  - [Repetition Syntax](#toc4_1_)    
  - [Character Sets](#toc4_2_)    
  - [Exclusion](#toc4_3_)    
  - [Character Ranges](#toc4_4_)    
  - [Escape Codes](#toc4_5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

---

- Regular Expressions, also referred to as *regex* or *regexp* in conversation
- Include a variety of rules for finding repetition, to text-matching, and much more
- A lot of parsing problems can be solved with regular expressions
- They are also a common interview question

## <a id='toc1_'></a>Searching for Patterns in Text [&#8593;](#toc0_)

- **`re.search()`**
  - Take the pattern, scan the text
  - Returns a **`Match`** object
  - If no pattern is found, a **`None`** is returned

In [1]:
import re

# List of patterns to search for
patterns: list[str] = ["term1", "term2"]

# Text to parse
text: str = "This is a string with term1, but it does not have the other term."

for pattern_to_search in patterns:
    print(f"Searching for {pattern_to_search} in: {text}\n...")

    # Check for match: re.search(pattern, text)
    if re.search(pattern_to_search, text):
        print("Match was found: ")
        print(re.search(pattern_to_search, text))
        print("-" * 75)
    else:
        print("No Match was found: ")
        print(re.search(pattern_to_search, text))
        print("-" * 75)

Searching for term1 in: This is a string with term1, but it does not have the other term.
...
Match was found: 
<re.Match object; span=(22, 27), match='term1'>
---------------------------------------------------------------------------
Searching for term2 in: This is a string with term1, but it does not have the other term.
...
No Match was found: 
None
---------------------------------------------------------------------------


- This **Match** object returned by the `search()` method is more than just a Boolean or None
- Has information about the match, including the original input string, the regular expression that was used, and the location of the match

In [2]:
import re
from re import Match
from typing import Optional

# List of patterns to search for
ptrn: str = "term1"

# Text to parse
text = "This is a string with term1, but it does not have the other term."

match: Optional[Match[str]] = re.search(ptrn, text)
print(type(match))

<class 're.Match'>


In [3]:
help(match)

Help on Match in module re object:

class Match(builtins.object)
 |  The result of re.match() and re.search().
 |  Match objects always have a boolean value of True.
 |  
 |  Methods defined here:
 |  
 |  __copy__(self, /)
 |  
 |  __deepcopy__(self, memo, /)
 |  
 |  __getitem__(self, key, /)
 |      Return self[key].
 |  
 |  __repr__(self, /)
 |      Return repr(self).
 |  
 |  end(self, group=0, /)
 |      Return index of the end of the substring matched by group.
 |  
 |  expand(self, /, template)
 |      Return the string obtained by doing backslash substitution on the string template, as done by the sub() method.
 |  
 |  group(...)
 |      group([group1, ...]) -> str or tuple.
 |      Return subgroup(s) of the match by indices or names.
 |      For 0 returns the entire match.
 |  
 |  groupdict(self, /, default=None)
 |      Return a dictionary containing all the named subgroups of the match, keyed by the subgroup name.
 |      
 |      default
 |        Is used for groups tha

In [4]:
if match:
    # Show start of match
    print(match.start())

    # Show end
    print(match.end())

22
27


## <a id='toc2_'></a>Split With Regular Expressions [&#8593;](#toc0_)

- `re.split()` returns a list with the term to split on removed
- The terms in the list are a split-up version of the string

In [5]:
# Term to split on
split_at_term: str = " "  # Each space
phrase: str = "What is the domain name of someone with the email: hello@gmail.com"

# Split the phrase
print(re.split(split_at_term, phrase))

['What', 'is', 'the', 'domain', 'name', 'of', 'someone', 'with', 'the', 'email:', 'hello@gmail.com']


- This is similar to using `str.split()`

In [6]:
print(phrase.split(split_at_term))

['What', 'is', 'the', 'domain', 'name', 'of', 'someone', 'with', 'the', 'email:', 'hello@gmail.com']


## <a id='toc3_'></a>Finding All Instances Of A Pattern [&#8593;](#toc0_)

- Use `re.findall()` to find all the instances of a pattern in a string

In [7]:
# Returns a list of all matches
cases: list[str] = re.findall("match", "test phrase match is in middle and match is close to the end.")
print(cases)

['match', 'match']


In [8]:
# Function to count the appearance of a specific text within a long string
def count_appearance_in_text(to_find: str, long_text: str) -> int:
    return len(re.findall(to_find, long_text))

to_find = "match"
text = "test phrase match is in middle and match is close to the end. And here are more match, match, match!! math!"
print("match", count_appearance_in_text(to_find, text))

match 5


## <a id='toc4_'></a>Pattern `re` Syntax [&#8593;](#toc0_)

- We can use *`metacharacters`* along with `re` to find specific types of patterns

In [9]:
def multi_re_find(patterns: list[str], phrase: str) -> None:
    """Takes in a list of regex patterns and prints a list of all matches.

    Args:
        `patterns`: A list of regex patterns.
        `phrase`: A list of all the matches.

    Returns:
        - `None`
    """
    
    for pattern in patterns:
        print(f"Searching the phrase using the re check: {pattern}")
        print(re.findall(pattern, phrase))
        print("\n")

### <a id='toc4_1_'></a>Repetition Syntax [&#8593;](#toc0_)

- There are 6 ways to express repetition in a pattern

Meta-Character|Meaning|Example
:-|:-|:-
`*`|*Zero or more times*|`"d*"`
`+`|*At least one or more times*|`"d+"`
`?`|*Zero or one time*|`"d?"`
`{m}`|*`m` times*|`"d{4}"`
`{m,n}`|*At least `m` times but no more than `n` times*|`"d{2,5}"`
`{m,}`|*At least `m` times or more*|`"d{5,}"`

In [10]:
test_phrase: str = "sdsd..sssddd...sdddsddd...dsdds...dsssss...sddddddd"
test_patterns: list[str] = [
    "sd*",      # s followed by zero or more d's: [0, ...]
    "sd+",      # s followed by one or more d's: [1, ...]
    "sd?",      # s followed by zero or one d's: [0, 1]
    "sd{3}",    # s followed by n d's: [n]
    "sd{2,3}",  # s followed by m to n d's: [m, n]
    "sd{3,}",   # s followed by m or more d's: [m, ...]
    "sd{,3}",   # s followed by zero or at most m d's: [0, ..., m]
]

multi_re_find(test_patterns, test_phrase)

Searching the phrase using the re check: sd*
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sdd', 's', 's', 's', 's', 's', 's', 'sddddddd']


Searching the phrase using the re check: sd+
['sd', 'sd', 'sddd', 'sddd', 'sddd', 'sdd', 'sddddddd']


Searching the phrase using the re check: sd?
['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd']


Searching the phrase using the re check: sd{3}
['sddd', 'sddd', 'sddd', 'sddd']


Searching the phrase using the re check: sd{2,3}
['sddd', 'sddd', 'sddd', 'sdd', 'sddd']


Searching the phrase using the re check: sd{3,}
['sddd', 'sddd', 'sddd', 'sddddddd']


Searching the phrase using the re check: sd{,3}
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sdd', 's', 's', 's', 's', 's', 's', 'sddd']




### <a id='toc4_2_'></a>Character Sets [&#8593;](#toc0_)

- Used when you wish to match any one of a group of characters at a point in the input
- Brackets are used to construct character set inputs

In [11]:
test_phrase = "sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd"
test_patterns = [
    "[sd]",     # either s or d
    "s[sd]+"    # s followed by one or more s or d
]

multi_re_find(test_patterns, test_phrase)

Searching the phrase using the re check: [sd]
['s', 'd', 's', 'd', 's', 's', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd', 's', 'd', 's', 'd', 's', 's', 's', 's', 's', 's', 'd', 'd', 'd', 'd']


Searching the phrase using the re check: s[sd]+
['sdsd', 'sssddd', 'sdddsddd', 'sds', 'sssss', 'sdddd']




### <a id='toc4_3_'></a>Exclusion [&#8593;](#toc0_)

- Use `^` to exclude terms by incorporating it into the bracket syntax notation
- Use `[^!.? ]` to check for matches that are not either a `!`,`.`,`?`, or `space`
- Add the `+` to check that the match appears at least once
  - This basically translate into finding all the words

In [12]:
test_phrase = "This is a string! But it has punctuation. How can we remove them?"
test_patterns = ["[^!.? ]+"]
multi_re_find(test_patterns, test_phrase)

Searching the phrase using the re check: [^!.? ]+
['This', 'is', 'a', 'string', 'But', 'it', 'has', 'punctuation', 'How', 'can', 'we', 'remove', 'them']




### <a id='toc4_4_'></a>Character Ranges [&#8593;](#toc0_)

- As character sets grow larger, typing every character that should (or should not) match could become very tedious
- A more compact format using character ranges allows to define a character set to include all of the contiguous characters between a start and stop point
- The format used is `[start-end]`
- Common use cases are to search for a specific range of letters in the alphabet

In [13]:
test_phrase = "ThiS is an exAmple senTence. Lets sEe if we can fInd soMe letters."
test_patterns = [
    "[a-z]+",  # sequences of lower case letters
    "[A-Z]+",  # sequences of upper case letters
    "[a-zA-Z]+",  # sequences of lower or upper case letters
    "[A-Z][a-z]+" # one upper case letter followed by lower case letters
]

multi_re_find(test_patterns, test_phrase)

Searching the phrase using the re check: [a-z]+
['hi', 'is', 'an', 'ex', 'mple', 'sen', 'ence', 'ets', 's', 'e', 'if', 'we', 'can', 'f', 'nd', 'so', 'e', 'letters']


Searching the phrase using the re check: [A-Z]+
['T', 'S', 'A', 'T', 'L', 'E', 'I', 'M']


Searching the phrase using the re check: [a-zA-Z]+
['ThiS', 'is', 'an', 'exAmple', 'senTence', 'Lets', 'sEe', 'if', 'we', 'can', 'fInd', 'soMe', 'letters']


Searching the phrase using the re check: [A-Z][a-z]+
['Thi', 'Ample', 'Tence', 'Lets', 'Ee', 'Ind', 'Me']




### <a id='toc4_5_'></a>Escape Codes [&#8593;](#toc0_)

- Use special escape codes to find specific types of patterns in data, such as digits, non-digits, whitespace, and more
- Indicated by prefixing the character with a backslash (`\`)
- A backslash must itself be escaped in normal Python strings, and that results in expressions that are difficult to read
- Using raw strings, created by prefixing the literal value with `r`, for creating regular expressions eliminates this problem and maintains readability

Code|Meaning
:-|:-
`\d`|A digit
`\D`|A non-digit
`\s`|Whitespace (tab, space, newline...)
`\S`|A non-whitespace
`\w`|Alphanumeric
`\W`|Non-alphanumeric

In [14]:
test_phrase = "This is a string with some numbers 1233 and a symbol #hashtag"
test_patterns = [
    r"\d+",  # sequence of digits
    r"\D+",  # sequence of non-digits
    r"\s+",  # sequence of whitespace
    r"\S+",  # sequence of non-whitespace
    r"\w+",  # alphanumeric characters
    r"\W+",  # non-alphanumeric
]

multi_re_find(test_patterns, test_phrase)

Searching the phrase using the re check: \d+
['1233']


Searching the phrase using the re check: \D+
['This is a string with some numbers ', ' and a symbol #hashtag']


Searching the phrase using the re check: \s+
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


Searching the phrase using the re check: \S+
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', '#hashtag']


Searching the phrase using the re check: \w+
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', 'hashtag']


Searching the phrase using the re check: \W+
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' #']


