## Regular Expression
- <b>Search:</b> Allows you to search for specific patterns of characters, such as phone numbers, email addresses, dates, etc.
- <b>Modify:</b> Can be used to extract data from text, or replace certain patterns with other text. For example, you can search for the word ‘Hello’ and replace it with ‘Hi’ or you can search for the phone number xx-xxxx-xxxx and replace it with the format xxx-xxx-xxxx.
- <b>Validate:</b> Validate user input. Developers can ensure data being processed meets certain criteria such as format, length, or type. For example, if a user is required to enter an email address, validation can ensure that the input contains an "@" symbol and a valid domain name.

## Common use cases
- Validating email addresses, URLs, and phone numbers.
- Web form Validations.
- Parsing and extracting information from text.
- Search and replace operations in strings.
- Text data cleaning and validation.


In [1]:
import re

### Raw Strings
- A regular string, when prefixed with 'r' or 'R' becomes a raw string.
- The difference between a normal string and a raw string is that the normal string in print() function translates escape characters (such as \n, \t etc.) if any, while those in a raw string are not.

In [None]:
rawstr = r"Hello! How are you?"
print(rawstr)

str1 = "Hello!\nHow are you?"
print("normal string: ", str1) # '\n' inside str1 (normal string) is translated as a newline being printed on the next line

str2 = r"Hello!\nHow are you?"
print("raw string: ", str2) # '\n' is printed as \n

Hello! How are you?
normal string:  Hello!
How are you?
raw string:  Hello!\nHow are you?


## RegEx Metacharacters
<table>
<tr>
    <th>Pattern</th>
    <th>Description</th>
</tr>

<tr>
    <td>[abc]</td>
    <td>match any of the characters a, b, or c</td>
</tr>

<tr>
    <td>[a-c]</td>
    <td>which uses a range to express the same set of characters</td>
</tr>

<tr>
    <td>[a-z]</td>
    <td>match only lowercase letters</td>
</tr>

<tr>
    <td>[0-9]</td>
    <td>match only digits</td>
</tr>

<tr>
    <td>.</td>
    <td>Matches any single character</td>
</tr>

<tr>
    <td>*</td>
    <td>Matches zero or more occurrences of the preceding element</td>
</tr>

<tr>
    <td>+</td>
    <td>Matches one or more occurrences of the preceding element</td>
</tr>

<tr>
    <td>?</td>
    <td>Matches zero or one occurrence of the preceding element</td>
</tr>

<tr>
    <td>^</td>
    <td>Matches the start of the string</td>
</tr>

<tr>
    <td>$</td>
    <td>Matches the end of the string</td>
</tr>

<tr>
    <td>|</td>
    <td>Serves as an OR operator</td>
</tr>

<tr>
    <td>[]</td>
    <td>Matches any single character within the brackets</td>
</tr>

<tr>
    <td>[^ ]</td>
    <td>Matches any single character NOT within the brackets</td>
</tr>

<tr>
    <td>()</td>
    <td>Groups regex operators together</td>
</tr>

<tr>
    <td>\d</td>
    <td>Matches any decimal digit; this is equivalent to the class [0-9]</td>
</tr>

<tr>
    <td>\D</td>
    <td>Matches any non-digit character</td>
</tr>

<tr>
    <td>\s</td>
    <td>Matches any whitespace character</td>
</tr>

<tr>
    <td>\S</td>
    <td>Matches any non-whitespace character</td>
</tr>

<tr>
    <td>\w</td>
    <td>Matches any alphanumeric character</td>
</tr>

<tr>
    <td>\W</td>
    <td>Matches any non-alphanumeric character</td>
</tr>

<tr>
    <td>\b</td>
    <td>boundary between word and non-word. /B is opposite of /b</td>
</tr>

<tr>
    <td>\</td>
    <td>It is used for special meaning characters like . to match a period or + for plus sign.
For example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can try putting a slash in front of it, \@. If its not a valid escape sequence, like \c, your python program will halt with an error
</td>
</tr>

<tr>
    <td>{n, m}</td>
    <td>Matches at least n and at most m occurrences of preceding</td>
</tr>

<tr>
    <td>a | b</td>
    <td>Matches either a or b</td>
</tr>

<tr>
    <td>\t, \n, \r</td>
    <td>tab, newline, return</td>
</tr>
</table>



## Examples
- 2 | 3 The | character is a “pipe” symbol and is used as an “or” operator. So 2 | 3 will return either 2 or 3.
- \d{3}-\d{3}-\d{4}:            Matches a US telephone number.
- [A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}:       Matches an email address.
- (http://|https://)?([A-Za-z0-9-]+\.[A-Za-z0-9-.]+)(/[A-Za-z0-9-._~:/?#@!$&'()*+,;=]*)?:   Matches a URL.

## Python RegEx Methods
<table>
<tr>
    <th>Method</th>
    <th>Description</th>
</tr>

<tr>
    <td>re.compile('pattern')</td>
    <td>Compile a regular expression pattern provided as a string into a re.Pattern object</td>
</tr>

<tr>
    <td>re.search(pattern, str)</td>
    <td>Search for occurrences of the regex pattern inside the target string and return only the first match</td>
</tr>

<tr>
    <td>re.match(pattern, str)</td>
    <td>Try to match the regex pattern at the start of the string. It returns a match only if the pattern is located at the beginning of the string</td>
</tr>

<tr>
    <td>re.fullmatch(pattern, str)</td>
    <td>Match the regular expression pattern to the entire string from the first to the last character</td>
</tr>

<tr>
    <td>re.findall(pattern, str)</td>
    <td>Scans the regex pattern through the entire string and returns all matches</td>
</tr>

<tr>
    <td>re.finditer(pattern, str)</td>
    <td>Scans the regex pattern through the entire string and returns an iterator yielding match objects</td>
</tr>

<tr>
    <td>re.split(pattern, str)</td>
    <td>It breaks a string into a list of matches as per the given regular expression pattern</td>
</tr>

<tr>
    <td>re.sub(pattern, replacement, str)</td>
    <td>Replace one or more occurrences of a pattern in the string with a replacement</td>
</tr>

<tr>
    <td>re.subn(pattern, replacement, str)</td>
    <td>Same as re.sub(). The difference is it will return a tuple of two elements.
First, a new string after all replacement, and second the number of replacements it has made
</td>
</tr>
</table>

## RegEx Flags
- Example: re.findall(pattern, string, flags=re.I|re.M|re.X) # use the | operator to connec them
<table>
<tr>
    <th>Flag</th>
    <th>Long Syntax</th>
    <th>Meaning</th>
</tr>

<tr>
    <td>re.A</td>
    <td>re.ASCII</td>
    <td>Perform ASCII-only matching instead of full Unicode matching</td>
</tr>

<tr>
    <td>re.I</td>
    <td>re.IGNORECASE</td>
    <td>Perform case-insensitive matching</td>
</tr>

<tr>
    <td>re.M</td>
    <td>re.MULTILINE</td>
    <td>This flag is used with metacharacter ^ (caret) and $ (dollar). When this flag is specified, the metacharacter ^ matches the pattern at beginning of the string and each newline’s beginning (\n).
And the metacharacter $ matches pattern at the end of the string and the end of each new line (\n)</td>
</tr>

<tr>
    <td>re.S</td>
    <td>re.DOTALL</td>
    <td>Make the DOT (.) special character match any character at all, including a newline. Without this flag, DOT(.) will match anything except a newline</td>
</tr>

<tr>
    <td>re.X</td>
    <td>re.VERBOSE</td>
    <td>Allow comment in the regex. This flag is useful to make regex more readable by allowing comments in the regex</td>
</tr>

## Match Object Methods
- When a match to a regex pattern is found, Python returns a Match object. Use methods of re.Match to extract matched values & positions.
<table>
<tr>
    <th>Method</th>
    <th>Meaning</th>
</tr>

<tr>
    <td>group()</td>
    <td>Return the string matched by the regex pattern</td>
</tr>

<tr>
    <td>groups()</td>
    <td>Returns a tuple containing the strings for all matched subgroups</td>
</tr>

<tr>
    <td>start()</td>
    <td>Return the start position of the match</td>
</tr>

<tr>
    <td>end()</td>
    <td>Return the end position of the match</td>
</tr>

<tr>
    <td>span()</td>
    <td>Return a tuple containing the (start, end) positions of the match</td>
</tr>
</table>

## Compile RegEx Pattern
- Python’s re.compile() method is used to compile a regular expression pattern provided as a string into a regex pattern object (re.Pattern). Later, we can use this pattern object to search for a match inside different target strings using regex methods such as a re.match() or re.search().

- In simple terms, we can compile a regular expression into a regex object to look for occurrences of the same pattern inside various target strings without rewriting it.

In [4]:
# re.compile(pattern, flags=0)
pattern = re.compile(r"\b\w{5}\b") # Look for a 5-letter word
res = pattern.findall("Jessa and Kelly")
print(pattern)
print(res)

re.compile('\\b\\w{5}\\b')
['Jessa', 'Kelly']


In [5]:
target_str = "My roll number is 25"

# created a regex pattern r'\d' to match any digit between 0 to 9
# used the re.findall() method to match our pattern
res = re.findall(r"\d", target_str) 

# extract matching value
print(res) # Output [2, 5]

['2', '5']


## Capture email address
- In the following example, we will capture the email addresses, starting from 'From:' and extending until the '@' character

In [6]:
stri = 'From: stephen.a.smith@espn.com, drake@hotmail.com, frenchMontana@gmail.com'
stri = stri.rstrip()
print(re.findall('From:.+@', stri))

['From: stephen.a.smith@espn.com, drake@hotmail.com, frenchMontana@']
