# Regex Coding Practice

Regular Expressions can be used in various NLP tasks: data pre-processing, pattern matching, text feature engineering, web scraping, data extractions and can be applied in many programming languages (e.g., Java, JS, python, unix sed, R ...)

For comparison between python, Perl, Vim - see https://remram44.github.io/regex-cheatsheet/regex.html

## Regex Syntax
### Class

**[  ]** - specifies a character class/ a set of characters to be matched

    [abc] - will match a,b,c

    [123] - will match 1,2,3
    


### Range

**dash** - indicates a range of characters to match

    [a-z]  - will match any small letter from a to z

    [0-9] -  will match any number from 0 to 9

### Operators

\* - will match zero or more

    [a-z]* - will match zero or more characters from a to z

\+ - will match one or more

    [a-z]+ - will match one or more characters from a to z: a, aa, abd etc

? - will match zero or one

    A? - Letter A will be optional

{n} - will specify exactly how many repetitions (n times)
   
    A{5} - Letter A is repeated 5 times

{n,m} - will specify a range at least n times, but no more than m times

    A{2,3} - Letter A could be repeated at least 2 times but no more than 3 times: AA and AAA

### Special Metacharacters

``. ^ $ * + ? { } [ ] \ | ( )``

``.`` will match any one character (numeric, non-numeric) [Dot]

``^`` will match at the beginning of the string [Caret]

``$`` will match at the end of the string [Dollar Sign]

``|`` will match a boolean OR a|b  [pipe]

``\`` will match literal string meaning of metacharacter [backslash]

    the literal meaning cannot be matched unless used with an escape [backslash] ``\``

    Ex. To find a real $, you will need ``\$``

### Regex Shortcuts

``\d``  will match any decimal digit 
        
        this is equivalent to the class [0-9]

``\D``  will match  any non-digit character 
    
        this is equivalent to the class [^0-9]

``\s``  will match any whitespace character

        this is equivalent to the class [ \t\n\r\f\v]: space, tab, newline, carriage return, vertical tab, page separator

``\S`` will match any non-whitespace character

        this is equivalent to the class [^ \t\n\r\f\v]

``\w`` will match any alphanumeric character

        this is equivalent to the class [a-zA-Z0-9_]

``\W`` will match any non-alphanumeric character

        this is equivalent to the class [^a-zA-Z0-9_]

### Return

``( )`` - will return only what is enclosed in the parenthesis

   ``[a-z]*([0-9][0-9])[a-z]`` will match abs12z but return only 12 (see translation for details)

**Translation**

- [a-z]* - match zero or more of characters from a-z
- [0-9] - match one number from the range 0-9
- [0-9] - match one number from the range 0-9
- [a-z] - match one letter from a to z

After the pattern is found

- ([0-9][0-9]) - select and print/return two numbers

## Python Built-In RegEx Modules

Python has a built-in module for regular expressions called **re**. Here are common methods:

- `re.sub()`
- `re.match()`
- `re.search()`
- `re.findall()`

In [5]:
import re

### re.sub

``re.sub(pattern, repl, string, count=0, flags=0)``

- pattern is a regular expression that you want to match
- repl is the replacement
- string is the input string
- count parameter specifies the maximum number of matches to replace. 
        zero count parameter (default) will replace all the matches
- If the sub() function cannot find a match, it returns the original string

**Ex.1** Replace the current phone number format (212)-456-7890 by digits only (no other characters). Note: review regex shortcuts.

In [6]:
phone_no = '(212)-456-7890'
pattern = '\D'
result = re.sub(pattern, '',phone_no)
print(result)

2124567890


**Ex.2** replace text surrounded with (*) (markdown format) with the <strong> tag in HTML

- ``r`` means the string will be treated as raw string
- ``<strong></strong>`` are two HTML tags. Because the closing tag ``</strong>`` has a literal backslash, you need to add a second backslash, otherwise it will be treaated as a metacharacter

In [8]:
string = 'Restaurant Review *Better Place*'
pattern = r'\*(.*?)\*' 
replacement = r'<strong>\1<\\strong>'
html = re.sub(pattern, replacement, string)
print(html)

Restaurant Review <strong>Better Place<\strong>


Translation


### re.match

`re.match(pattern, string)`  returns a match object on success and none on failure.

In [None]:
import re

#match a word at the beginning of a string

result = re.match('NLP',r'NLP is ')
print(result)

Sources:
    
<small>https://www.pythontutorial.net/python-regex/python-regex-sub/</small>

https://www.analyticsvidhya.com/blog/2021/03/beginners-guide-to-regular-expressions-in-natural-language-processing/

https://www.kaggle.com/code/albeffe/regex-exercises-solutions

https://regexone.com/references/python