# Resources

* https://regex101.com/

In [1]:
from typing import (
    List,
    Dict,
    Tuple,
    Optional
)
import os
import sys
import re
import string

print(sys.version)

3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05) 
[GCC 9.3.0]


# References

* [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html)

---
# Pattern

## Meta characters

```
. ^ $ * + ? { } [ ] \ | ( )
```

### Group/Class

```[...]``` is a **character class** to specify a set of characters to match.

**Metacharacters are NOT active inside a class ```[...]```**.

### Sequence

```
\d
Matches any decimal digit; this is equivalent to the class [0-9].

\D
Matches any non-digit character; this is equivalent to the class [^0-9].

\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

\S
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

\w
Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

\W
Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
```

## word boundary ```\b```

* [Word Boundaries](https://www.regular-expressions.info/wordboundaries.html)

> The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a “word boundary”. This match is zero-length.

---
## Raw string

Make sure the regexp **sequence** literal is passed as **raw string**. 

```
re.compile(r"[\s]+") # <--- use raw string r"..."
```

Regexp **'\s'** is NOT a valid string for Python interpreter, hence causes an error, because it is not one of the defined [escape sequences](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals)). 

```
\a
ASCII Bell (BEL)
\b
ASCII Backspace (BS)
\f
ASCII Formfeed (FF)
\n
ASCII Linefeed (LF)
\r
ASCII Carriage Return (CR)
\t
ASCII Horizontal Tab (TAB)
\v
ASCII Vertical Tab (VT)
```

In [4]:
import re
re.compile(r"[\s]+")    # <---- Tell Python to pass the string AS IS.

re.compile(r'[\s]+', re.UNICODE)

In [3]:
r"\s\W"   # Generates escaped string

'\\s\\W'

## Escape

Use [re.escape](https://docs.python.org/3/library/re.html#re.escape)

> Escape special characters in pattern. This is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it. 

In [4]:
re.escape(r"!#$%&'*+-.^_`|~:")

"!\\#\\$%\\&'\\*\\+\\-\\.\\^_`\\|\\~:"

## Patterns

### Match punctuations and space characters

In [5]:
pattern = '[%s%s]+' % (re.escape(string.punctuation), r"\s")
pattern

'[!"\\#\\$%\\&\'\\(\\)\\*\\+,\\-\\./:;<=>\\?@\\[\\\\\\]\\^_`\\{\\|\\}\\~\\s]+'

In [6]:
text = "[(I am a @^-^& cat! who has no~~~     name and no #id%%;(.)]"
re.compile(pattern).sub(repl=" ", string=text).strip().lower()

'i am a cat who has no name and no id'

In [7]:
corpus = """This website contains the full text of the Python Data Science Handbook by Jake VanderPlas; 
the content is available on GitHub in the form of Jupyter notebooks."""

In [8]:
#%%timeit
pattern = '[%s%s]+' % (re.escape(string.punctuation), r"\s")
words = re.compile(pattern).sub(repl=" ", string=corpus).lower().strip().split()
id_to_word = dict(enumerate(set(words)))
word_to_id = dict(zip(id_to_word.values(), id_to_word.keys()))

In [9]:
for index in [word_to_id[w] for w in words]:
    print(id_to_word[index])

this
website
contains
the
full
text
of
the
python
data
science
handbook
by
jake
vanderplas
the
content
is
available
on
github
in
the
form
of
jupyter
notebooks


In [18]:
pattern = '[%s%s]+' % (re.escape(r'`~!@#$%^&*()_.=+\[\]{}\\\|;:\"\'<>,/?'), r"\s")
re.sub(pattern, " ", text)

' I am a - cat who has no name and no id '

In [30]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [35]:
sentences = "[ ( . I am ...... @@@@@ #@&*^$&*R a @^-^& cat! who has no~~~     name and no #id%%;(.)]"
removals = re.escape(string.punctuation)
pattern = '[%s]+' % (removals)
sentences = re.sub(pattern, " ", sentences)  
sentences

'      I am      R a   cat  who has no      name and no  id '

---

## 4 digits

In [17]:
match = re.match("^[0-9]{4}$", "0034")
match.group(0)

'0034'

## Year

In [15]:
match = re.match("^[1-2][0-9]{3}$", "2009")
match.group(0)

'2009'

## URL

In [14]:
url = "https://sec.gov/Archives/edgar/data/310158/000031015821000032/index.xml"
pattern = r"(http|https)://(www\.sec\.gov|sec\.gov)(.*/)index.xml"
match = re.search(pattern, url, re.IGNORECASE)
match.group(0)

'https://sec.gov/Archives/edgar/data/310158/000031015821000032/index.xml'

---
# Examples

In [14]:
# Find "PLACE 1 2 NORTH"
def parse_place_command(line: str) -> Optional[Tuple[int, int, str]]:
    pattern = r'[\t\s]*^PLACE[\t\s]+([0-9]+)[\t\s]+([0-9]+)[\t\s]+(NORTH|EAST|WEST|SOUTH)'
    if match := re.search(pattern, line, re.IGNORECASE):
        x = int(match.group(1))
        y = int(match.group(2))
        direction = match.group(3).upper()
        return x, y, direction
    else:
        return None

def parse_command(line, command) -> Optional[str]:
    pattern = r'^[\t\s]*({})[\t\s]*'.format(command)
    return match.group(0).upper() if (match := re.search(pattern, line, re.IGNORECASE)) else None
    
def parse_move_command(line: str) -> Optional[str]:
    return "MOVE" if re.search(r'^[\t\s]*MOVE[\t\s]*', line, re.IGNORECASE) else None

def parse_left_command(line: str) -> Optional[str]:
    return "LEFT" if re.search(r'^[\t\s]*LEFT[\t\s]*', line, re.IGNORECASE) else None

def parse_right_command(line: str) -> Optional[str]:
    return "RIGHT" if re.search(r'^[\t\s]*RIGHT[\t\s]*', line, re.IGNORECASE) else None

def parse_report_command(line: str) -> Optional[str]:
    return REPORT if re.search(r'^[\t\s]*REPORT[\t\s]*', line, re.IGNORECASE) else None

In [15]:
line = "place   0 1   SouTh"
if args := parse_place_command(line):
    x = args[0]
    y = args[1]
    direction = args[2]
    
    print("{} {} {}".format(x, y, direction))
    
print(args)

0 1 SOUTH
(0, 1, 'SOUTH')


In [16]:
line = "MOVE"
parse_move_command(line)

'MOVE'