# Regex

In [2]:
import re

Anchors

- line anchors
  - `^`: startswith expression before it experiences a line break (match to the start of line)
  - `$`: endswith expression before it experiences a line break (match to the end of line)
- string anchors
  - `\A`: startswith expression (match to the start of string, fit for multi-line string)
  - `\Z`: endswith expression (match to the end of string, fit for multi-line string)

In [25]:
s = 'hi hello\ntop spot' #\n: line break

# re.M: flag to treat input as multiline string
print(bool(re.search(r'hello$', s, flags=re.M))) 
print(bool(re.search(r'hello\Z', s)))

True
False


`[]`: a set of character, each char in `[]` is treated as an individual RE
- `[,.]`: returns a match where one of the specified characters (`.`, `,`) are present
- `[^abc]`: returns a match for any letter EXCEPT `a`, `b`, `c`
  - `[^]`: invert selection
- `[0-5][0-9]`	returns a match for any two-digit numbers from 00 and 59

`\`: special character
- `\w`: match if the string has any word characters (`a-Z`, `0-9`, `_`)
- `\W`: match if the string doesn't have any word characters
- `\d` = `[0-9]`
- `\D` = `[^0-9]` or `[^\d]`
- `\s`: match whitespace character
- `\S`: match non-whitespace character

`.`: zero or more
- if want to find actual dot `.`, then need to add `backslash` before the `.`

`+`: one or more

`?`: zero or 
- `re.compile("https?://(www\.)?([a-zA-Z]+)\.([a-zA-Z]+)")`
  - s?: s is optional char
  - (www\.)?: `www.` is optional group

In [72]:
s = '*&w^esss%%$#3'
print(re.findall(r'\w+', s))
print(re.findall(r'\W+', s))

['w', 'esss', '3']
['*&', '^', '%%$#']


In [75]:
s = '\t\n\r\f\v ssssdeeeesw'
print(re.findall(r'\s', s))
print(re.findall(r'\S', s))

['\t', '\n', '\r', '\x0c', '\x0b', ' ']
['s', 's', 's', 's', 'd', 'e', 'e', 'e', 'e', 's', 'w']


re.split(pattern, string): split the string by occurrence of pattern

In [3]:
re.split(r'-', '+1-001-0111-1001')

['+1', '001', '0111', '1001']

In [4]:
s = '+1-001-0111-1001'
s.split('-')

['+1', '001', '0111', '1001']

In [26]:
# split all occurrence of `.`, `,`
re.split(r'[.|,]', '1,000,000.001')

['1', '000', '000', '001']

re group

- `()`: matches the expression inside the parentheses and groups it
  - `a(?:b|c)d = (abd|acd)`
- `(?)`: ? acts as a extension notation. Its meaning depends on the character immediately to its right.
- `(?P<name>string)`: named capture group
  - retrieve named group key/value pair from `re.match(pattern, string).groupdict()[key]`
- `{m,n}`: matches the expression to its left at least m times up to n times (inclusive).
  - `{,n}`: from 0 to n times
  - `{n}`: match exactly n times
- `(?<=B)A`: positive lookbehind assertion. This matches the expression A only if B is immediately to its left.

In [64]:
pattern = r'\bpar(en|ro)?t\b'
s = 'parrot'
re.match(pattern, s)

<re.Match object; span=(0, 6), match='parrot'>

In [49]:
re.findall(r'hand(?:y|ful)e', '123handed42handye777handfule500')

['handye', 'handfule']

In [65]:
m = re.match(r'(?P<first>\w+) (?P<last>\w+)', 'Jane Doe')
m.groupdict()

{'first': 'Jane', 'last': 'Doe'}

In [48]:
# find all the substrings that contains 2 or more vowels
s = 'rabcdeefgyYhFjkIoomnpOeorteeeeet'

res = []
i = 0
while i < len(s): 
    if s[i] in ['a', 'e', 'i', 'o', 'u', 'A', 'E', 'I', 'O', 'U']:
        j = i + 1
        while s[j] in ['a', 'e', 'i', 'o', 'u', 'A', 'E', 'I', 'O', 'U']:
            j += 1
        if j - i >= 2:
            res.append(s[i:j])
        i = j
    else:
        i += 1
print(res)

['ee', 'Ioo', 'Oeo', 'eeeee']


In [47]:
# use regex
# https://www.hackerrank.com/challenges/re-findall-re-finditer/forum

vowel = 'aeiou'
consonant = "qwrtypsdfghjklzxcvbnm"
res = re.findall(r'(?<=[%s])([%s]{2,})[%s]' % (consonant, vowel, consonant), s, flags=re.I)
res


['ee', 'Ioo', 'Oeo', 'eeeee']

## Reference

- https://www.w3schools.com/python/python_regex.asp
- https://www.hackerrank.com/dashboard
- https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf
- https://learnbyexample.github.io/python-regex-cheatsheet/