# Regex introduction

This notebook has been adapted from the fast.ai course on Natural Language processing: https://github.com/fastai/course-nlp.

Regex (regular expressions) is a pattern matching language. It is readily available in Python and many other programming or script languages. Regex can be useful for:

* Search expressions
* Find and replace expressions
* Cleaning of datasets

Let's consider a motivating example.

## The phone number problem

Suppose we are given some data that includes phone numbers:

123-456-7890

123 456 7890

(123)456-7890

101 Howard

Some of the phone numbers have different formats (hyphens, no hyphens, parentheses).  Also, there are some errors in the data-- 101 Howard isn't a phone number!  How can we find all the phone numbers?

We will attempt this without regex, but will see that this quickly leads to lot of if/else branching statements and isn't a veyr promising approach:

### Attempt 1 (without regex)

In [1]:
import string

In [2]:
phone1 = "123-456-7890"

phone2 = "123 456 7890"

not_phone1 = "101 Howard"

In [3]:
string.digits

'0123456789'

In [4]:
def check_phone(inp):
    valid_chars = string.digits + ' -()'
    for char in inp:
        if char not in valid_chars:
            return False
    return True

In [5]:
assert(check_phone(phone1))
assert(check_phone(phone2))
assert(not check_phone(not_phone1))

### Attempt 2  (without regex)

In [6]:
not_phone2 = "1234"

In [7]:
assert(not check_phone(not_phone2))

AssertionError: 

We have to add an additional check to verify the length is valid.

In [8]:
def check_phone(inp):
    nums = string.digits
    valid_chars = nums + ' -()'
    num_counter = 0
    for char in inp:
        if char not in valid_chars:
            return False
        if char in nums:
            num_counter += 1
    if num_counter==10:
        return True
    else:
        return False

In [9]:
assert(check_phone(phone1))
assert(check_phone(phone2))
assert(not check_phone(not_phone1))
assert(not check_phone(not_phone2))

### Attempt 3  (without regex)

However, we now also need to extract the digits!

In [10]:
not_phone3 = "34 50 98 21 32"

assert(not check_phone(not_phone3))

AssertionError: 

In [11]:
not_phone4 = "(34)(50)()()982132"

assert(not check_phone(not_phone3))

AssertionError: 

This is getting increasingly unwieldy.  We need a different approach.

## Introducing regex

Useful regex resources:

- https://regexr.com/
- http://callumacrae.github.io/regex-tuesday/
- https://regexone.com/

**Best practice: Be as specific as possible.**

Parts of the following section were adapted from Brian Spiering, who taught the MSDS [NLP elective](https://github.com/brianspiering/nlp-course).

### What is regex?

Regular expressions is a pattern matching language. 

Instead of writing `0 1 2 3 4 5 6 7 8 9`, you can write `[0-9]` or `\d`

It is Domain Specific Language (DSL). Powerful (but limited language). 

**Other examples of DSLs:**
- SQL  
- Markdown
- TensorFlow

### Matching Phone Numbers (The "Hello, world!" of Regex)

`[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]` matches US telephone number.

Refactored: `\d\d\d-\d\d\d-\d\d\d\d`

A **metacharacter** is one or more special characters that have a unique meaning and are NOT used as literals in the search expression. For example "\d" means any digit.

**Metacharacters are the special sauce of regex.**



Quantifiers
-----

Allow you to specify how many times the preceding expression should match. 

`{}` is an extact qualifer

Refactored: `\d{3}-\d{3}-\d{4}`

Unexact quantifiers
-----

1. `?` question mark - zero or one 
2. `*` star - zero or more
3. `+` plus sign - one or more | 

### Regex can look really weird, since it's so concise

The best (only?) way to learn it is through practice.  Otherwise, you feel like you're just reading lists of rules.

The lessons on [regexone](https://regexone.com/) are a very good starting point to learn regex through practice.

**Reminder: Be as specific as possible!**

### Pros & Cons of Regex

**What are the advantages of regex?**

1. Concise and powerful pattern matching DSL
2. Supported by many computer languages, including SQL

**What are the disadvantages of regex?**

1. Brittle 
2. Hard to write, can get complex to be correct
3. Hard to read

In [12]:
import re

In [13]:
# this regex matches candidate phone numbers with format 000-000-0000 or 000 000 0000 

def check_phone(inp):
    rule = '^[0-9]{3}[-\s][0-9]{3}[-\s][0-9]{4}$'
    regexp = re.compile(rule)
    if regexp.match(inp):
        print('valid phone number')
        return True
    else:
        print('not valid phone number')
        return False

In [14]:
assert(check_phone(phone1))
assert(check_phone(phone2))
assert(not check_phone(not_phone1))
assert(not check_phone(not_phone2))

valid phone number
valid phone number
not valid phone number
not valid phone number


Regex Terms
----


- __target string__:	This term describes the string that we will be searching, that is, the string in which we want to find our match or search pattern.


- __search expression__: The pattern we use to find what we want. Most commonly called the regular expression. 


- __literal__:	A literal is any character we use in a search or matching expression, for example, to find 'ind' in 'windows' the 'ind' is a literal string - each character plays a part in the search, it is literally the string we want to find.

- __metacharacter__: A metacharacter is one or more special characters that have a unique meaning and are NOT used as literals in the search expression. For example "." means any character.

Metacharacters are the special sauce of regex.

- __escape sequence__:	An escape sequence is a way of indicating that we want to use a metacharacters as a literal. 

In a regular expression an escape sequence involves placing the metacharacter \ (backslash) in front of the metacharacter that we want to use as a literal. 

`'\.'` means find literal period character (not match any character)

Regex Workflow
---
1. Create pattern in Plain English
2. Map to regex language
3. Make sure results are correct:
    - All Positives: Captures all examples of pattern
    - No Negatives: Everything captured is from the pattern
4. Don't over-engineer your regex. 
    - Your goal is to Get Stuff Done, not write the best regex in the world
    - Filtering before and after are okay.