# Regular Expressions

<a href="https://docs.python.org/3/library/re.html?highlight=re#module-re">[RE Documentation]</a>

`Text matching patterns` that describe a formal syntax

In [1]:
import re

In [2]:
patterns = ['term1', 'term2']

In [11]:
text = "This is a string with term1, but not the other term"

### Methods from `re`

```python
    re.search()
```

In [6]:
print(re.search('hello', 'hello world'))

<re.Match object; span=(0, 5), match='hello'>


In [13]:
for pattern in patterns:
    print(f"Searching for {pattern} in: \n\"{text}\"")
    
    if re.search(pattern, text):
        print('\n')
        print("Match found!\n")
    else:
        print('\n')
        print("Match not found!\n")

Searching for term1 in: 
"This is a string with term1, but not the other term"


Match found!

Searching for term2 in: 
"This is a string with term1, but not the other term"


Match not found!



In [14]:
match = re.search(patterns[0], text)
type(match)

re.Match

In [15]:
# start
match.start()

22

In [16]:
# end
match.end()

27

In [17]:
split_term = '@'

phrase = "What is your email, is it mail@gmail.com?"

In [18]:
re.split(split_term, phrase)

['What is your email, is it mail', 'gmail.com?']

In [19]:
# similar to
'hello world'.split()

['hello', 'world']

```python
    re.findall()
```

In [20]:
re.findall('match', "Here's a match, here's another match, and another one!")

['match', 'match']

### Pattern Matching

In [28]:
def multi_re_find(patterns, phrase):
    '''
        Takes a list of regex patterns
        Prints a list of all matches
    '''
    
    for pattern in patterns:
        print(f"Searching the phrase using re check: {pattern}")
        print(re.findall(pattern, phrase))
        print('\n')

### Repetition Syntax

There are five ways to express repetition in a pattern:

   1. A pattern followed by the meta-character <code>\*</code> is repeated **zero or more times**. 
   2. Replace the <code>\*</code> with <code>+</code> and the pattern must appear **at least once**. 
   3. Using <code>?</code> means the pattern appears **zero or one time**. 
   4. For a specific number of occurrences, use <code>{m}</code> after the pattern, where **m** is replaced with the number of times the pattern should repeat. 
   5. Use <code>{m,n}</code> where **m** is the minimum number of repetitions and **n** is the maximum. Leaving out **n** <code>{m,}</code> means the value appears at least **m** times, with no maximum.

In [29]:
test_phrase = 'sdsd..sssddd...sddsdd...dsds...dssss...sdddd'

test_patterns = [ 'sd*', # 's' followed by zero or more 'd's
                'sd+', # 's' followed by one or more 'd's
                'sd?', # 's' followed by zero or one 'd's
                'sd{3}', # 's' followed by 3 'd's
                'sd{2,3}' # 's' followed by 2 or 3 'd's
                ]

multi_re_find(test_patterns, test_phrase)

Searching the phrase using re check: sd*
['sd', 'sd', 's', 's', 'sddd', 'sdd', 'sdd', 'sd', 's', 's', 's', 's', 's', 'sdddd']


Searching the phrase using re check: sd+
['sd', 'sd', 'sddd', 'sdd', 'sdd', 'sd', 'sdddd']


Searching the phrase using re check: sd?
['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 'sd']


Searching the phrase using re check: sd{3}
['sddd', 'sddd']


Searching the phrase using re check: sd{2,3}
['sddd', 'sdd', 'sdd', 'sddd']




## Character Sets

Character sets are used when you wish to match any one of a group of characters at a point in the input. Brackets are used to construct character set inputs. For example: the input <code>[ab]</code> searches for occurrences of either **a** or **b**.
Let's see some examples:

In [30]:
test_phrase = 'sdsd..sssddd...sddsdd...dsds...dssss...sdddd'

test_patterns = ['[sd]', # either 's' or 'd'
                's[sd]+' # 's' followed by one or more 's' or 'd'
                ]

multi_re_find(test_patterns, test_phrase)

Searching the phrase using re check: [sd]
['s', 'd', 's', 'd', 's', 's', 's', 'd', 'd', 'd', 's', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 's', 'd', 's', 's', 's', 's', 's', 'd', 'd', 'd', 'd']


Searching the phrase using re check: s[sd]+
['sdsd', 'sssddd', 'sddsdd', 'sds', 'ssss', 'sdddd']




## Exclusion

We can use <code>^</code> to exclude terms by incorporating it into the bracket syntax notation. For example: <code>[^...]</code> will match any single character not in the brackets.

In [31]:
test_phrase = "This is a string! But it has punctuation. How can we remove it?"

In [32]:
re.findall('[^!.? ]+', test_phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

## Character Ranges

As character sets grow larger, typing every character that should (or should not) match could become very tedious. A more compact format using character ranges lets you define a character set to include all of the contiguous characters between a start and stop point. The format used is <code>[start-end]</code>.

Common use cases are to search for a specific range of letters in the alphabet. For instance, <code>[a-f]</code> would return matches with any occurrence of letters between a and f.

In [34]:
test_phrase = "This is an example sentence. Let's see if we can find some letters."

test_patterns = ['[a-z]+', # sequences of lowercase letters
                '[A-Z]+', # sequences of uppercase letters
                '[a-zA-Z]+', # sequences of lowercase or uppercase letters
                '[A-Za-z]+' # sequences of uppercase or lowercase letters
                ]

multi_re_find(test_patterns, test_phrase)

Searching the phrase using re check: [a-z]+
['his', 'is', 'an', 'example', 'sentence', 'et', 's', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']


Searching the phrase using re check: [A-Z]+
['T', 'L']


Searching the phrase using re check: [a-zA-Z]+
['This', 'is', 'an', 'example', 'sentence', 'Let', 's', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']


Searching the phrase using re check: [A-Za-z]+
['This', 'is', 'an', 'example', 'sentence', 'Let', 's', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']




## Escape Codes

You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits, whitespace, and more. For example:

<table border="1" class="docutils">
<colgroup>
<col width="14%" />
<col width="86%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Code</th>
<th class="head">Meaning</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\d</span></tt></td>
<td>a digit</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\D</span></tt></td>
<td>a non-digit</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\s</span></tt></td>
<td>whitespace (tab, space, newline, etc.)</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\S</span></tt></td>
<td>non-whitespace</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\w</span></tt></td>
<td>alphanumeric</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\W</span></tt></td>
<td>non-alphanumeric</td>
</tr>
</tbody>
</table>

Escapes are indicated by <b>prefixing</b> the character with a backslash <code>\</code>r. Unfortunately, a backslash must itself be escaped in normal Python strings, and that results in expressions that are difficult to read. Using raw strings, created by prefixing the literal value with <code>r</code>, eliminates this problem and maintains readability.

In [36]:
test_phrase = "This is a string with some numbers 1234 and a symbol #hastag"

test_patterns = [r'\d+', # sequence of digits
                r'\D+', # sequence of non-digits
                r'\s+', # sequence of whitespace
                r'\S+', # sequence of non-whitespace
                r'\w+', # sequence of alphanumeric
                r'\W+' # sequence of non-alphanumeric
                ]

multi_re_find(test_patterns, test_phrase)

Searching the phrase using re check: \d+
['1234']


Searching the phrase using re check: \D+
['This is a string with some numbers ', ' and a symbol #hastag']


Searching the phrase using re check: \s+
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


Searching the phrase using re check: \S+
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1234', 'and', 'a', 'symbol', '#hastag']


Searching the phrase using re check: \w+
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1234', 'and', 'a', 'symbol', 'hastag']


Searching the phrase using re check: \W+
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' #']




More `re` tutorials at <a href="http://www.tutorialspoint.com/python/python_reg_expressions.htm">[TutorialsPoint]</a>