<p><a name="sections"></a></p>
<br>
<br>
# Sections
- <a href="#re">Regular Expressions</a><br>
    - <a href="#meta">Metacharacters</a><br>
        - <a href="#dot">Dot</a><br>
        - <a href="#question">Question mark, plus, asterisk, and {}</a><br>
        - <a href="#caret">Caret/dollar sign</a><br>
        - <a href="#bracket">Bracket</a><br>
        - <a href="#vertical">Vertical Bar</a><br>
        - <a href="#backslash">Backslash</a><br>
    - <a href="#function">Functions in Regular Expression</a><br>
        - <a href="#sub">re.sub</a><br>
        - <a href="#split">re.split</a><br>
        - <a href="#findall">re.findall</a><br>
    - <a href="#example">Example: Wordcount</a><br>
    - <a href="#email">Example: Is it an email address</a><br>

<p><a name="re"></a></p>

## Regular Expressions

We have seen some basic and intermediate functions for handling and working with strings.

However, if you really want to unleash the power of string manipulation, it's necessary to learn regular expressions.

- **Concept**

A regular expression is a special text string for describing a certain amount of text. This “certain amount of text” receives the formal name of **pattern**. Hence we say that a regular expression is a pattern that describes a set of strings.

The goal of using regular expressions is to extract specific characters from text by describing its pattern.

- **Pattern**

For example, both **gray** and **grey** match the pattern **gr.y** in which the dot **.** refers to an arbitrary character.

<p><a name="meta"></a></p>
### Metacharacters
[[back to top]](#sections)

The simplest form of regular expressions is a pattern that matches a single character, for example, `a` matches exactly the character 'a'.

However, there are some special characters that have a reserved status and they are known as **metacharacters**.

>    . ^ $ * + ? { } [ ] \ | ( )

These metacharacters have special meaning when working with regular expressions. So the expression `a|b` does not match exactly the characters `a|b`. 

The backslash `\` is called an **escape operator**, which is used for turning these metacharacters into normal characters. For example, `a\|b` in regular expression matches exactly the character `a|b`.

### Python Module: re

The library **re** is used to implement regular expressions in python.

In [None]:
import re
raw_string = 'Hi, how are you today?, Hi'
print(re.search('Hi', raw_string))

In [None]:
print(re.search('Hello', raw_string))

It returns a SRE_Mathch object if there exists a match, otherwise returns None.

In [None]:
s = re.search('Hi', raw_string)
print(s.start()) # the starting position of of the matched string
print(s.end())   # the ending position index of the matched string
print(s.span())  # a tuple containing the (start, end) positions of the matched string
print(s.group()) # the matched string

### The meaning of metacharacters
`re.search(pattern, string) != None` is `True` if the string matches the pattern. We will use this function to test our regular expressions.

<p><a name="dot"></a></p>
#### dot
[[back to top]](#sections)

`.` refers to any single characters. For example, `a.` matches any two characters start with 'a': `aa`, `ab`, `an`, `a1`, `a#`, etc.

In [None]:
print(re.search('a.', 'aa') != None)
print(re.search('a.', 'ab') != None)
print(re.search('a.', 'a1') != None)
print(re.search('a.b', 'a+b') != None)
print(re.search('a.b', 'a+x+b') != None)
print(re.search('../../201.', 'From 06/01/2015') != None)

What if we just want to extract the month from the entire string?

`re.search('../../201.', 'From 06/01/2015')` -> `06`

In [None]:
s = re.search('(..)/(..)/(201.)', 'From 06/01/2015')
print(s.group(1))

<p><a name="question"></a></p>
#### Question mark, plus, asterisk, and {}
[[back to top]](#sections)

`?` matches the preceding expression either once or zero times.

`+` matches the preceding expression character at least once.

`*` matches the preceding expression character arbitrary times.

`{m,n}` matches the preceding expression at least m times and at most n times.

For example, `ba?b` matches `bab` and `bb`.

In [None]:
print(re.search('ba?b', 'bb') != None)    # match
print(re.search('ba?b', 'bab') != None)   # match
print(re.search('ba?b', 'baab') != None)  # does not match

`ba+b` matches `bab` and `baab`. `baaaab`, `baaaaaab`, etc.

In [None]:
print(re.search('ba+b', 'bb') != None)    # does not match
print(re.search('ba+b', 'bab') != None)   # match
print(re.search('ba+b', 'baab') != None)  # match

`ba*b` matches both of them.

In [None]:
print(re.search('ba*b', 'bb') != None)    # match
print(re.search('ba*b', 'bab') != None)   # match
print(re.search('ba*b', 'baaaaaab') != None)  # match

`ba{1,3}b` matches `bab`, `baab` and `baaab`.

In [None]:
print(re.search('ba{1,3}b', 'bab') != None)    # match
print(re.search('ba{1,3}b', 'baab') != None)   # match
print(re.search('ba{1,3}b', 'baaab') != None)  # match

print(re.search('ba{1,3}b', 'bb') != None)     # does not match
print(re.search('ba{1,3}b', 'baaaab') != None) # does not match

`ba{0,1}b` is the same as `ba?b`. 

`ba{1,}b` is the same as `ba+b`. 

`ba{3,}b` matches `baaab`, `baaaab`, etc, in which 'a' appears more than 3 times.

<p><a name="caret"></a></p>
#### caret / dollar sign
[[back to top]](#sections)

`^` refers to the beginning of a text, while `$` refers to the ending of a text. 

For example, `^a` matches any text that begins with character `a`.

`a$` matches any text ending with character `a`. 

In [None]:
print(re.search('^a', 'abc') != None)    # match
print(re.search('^a', 'abcde') != None)  # match
print(re.search('^a', ' abcde') != None) # does not match

In [None]:
print(re.search('a$', 'aba') != None)   # match
print(re.search('a$', 'abcba') != None)  # match
print(re.search('a$', ' aba ') != None)  # does not match

<p><a name="bracket"></a></p>
#### bracket
[[back to top]](#sections)

- `[]` is used to specify a set of characters that you wish to match. For example, `[123abc]` will match any of the characters `1, 2, 3, a, b`, or `c` ; this is the same as `[1-3a-c]`, which uses a range to express the same set of characters. Further more `[a-z]` matches all lower letters, while `[0-9]` matches all numbers.
- Special characters lose their special meaning inside sets. For example, `[(+*)]` will match any of the literal characters '(', '+', `'*'`, or ')'.
- Characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched. 

In [None]:
print(re.search('[123abc]', 'defg')  != None)   # does not match
print(re.search('[123abc]', '1defg') != None)   # match
print(re.search('[1-3a-c]', '2defg') != None)   # match
print(re.search('[15abij]', '2degh') != None)   # does not match
print(re.search('[^15abij]', '2degh') != None)   # match

The expression `()` is very similar to its mathematical meaning, the brackets group the expressions contained inside them, and you can repeat the contents in a group with a repeating qualifier. 

For example, the pattern `(abc){2,3}` matches `abc` 2 or 3 times.

In [None]:
print(re.search('(abc){2,3}', 'abc')  != None)         # does not match
print(re.search('(abc){2,3}', 'abcabc')  != None)      # match
print(re.search('(abc){2,3}', 'abcabcabc')  != None)   # match

print(re.search('(Vivian, ){2,}', 'Vivian, Vivian, Jason, ')  != None)   # match
print(re.search('(Vivian, ){2,}', 'Vivian, Jason, Vivian, ')  != None)   # does not match

<p><a name="vertical"></a></p>
#### vertical bar
[[back to top]](#sections)

`|` is a logical operator. For examples, `a|b` matches `a` or `b`, which is similar to `[ab]`. 
`abc|123` matches `abc` or `123`, while `[abc123]` matches any single characters in `a, b, c, 1, 2, 3`. 

In [None]:
print(re.search('abc|123', 'a') != None)   # does not match
print(re.search('abc|123', '1') != None)   # does not match
print(re.search('abc|123', '123') != None) # match
print(re.search('abc|123', 'abc') != None) # match

<p><a name="blackslash"></a></p>
#### backslash
[[back to top]](#sections)

If you want to match exactly `?`, it is necessary to add a backslash `\?`. Otherwise, the character `?` will be treated as a metacharacter. `?` matches a character(group) either once or zero times.

In [None]:
print(re.search('\?', 'Hi, how are you today?') != None)

```python
>>> print re.search('?', 'Hi, how are you today?') != None
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/re.py", line 142, in search
    return _compile(pattern, flags).search(string)
  File "/usr/lib/python2.7/re.py", line 244, in _compile
    raise error, v # invalid expression
sre_constants.error: nothing to repeat
```

## Functions in Regular Expression


- **re.split(pattern, string)**: Split the `string` into a list by the `pattern`.
- **re.sub(pattern, replace, string)**: Replace the substrings in the `string` that matches the `pattern` with the argument `replace`.
- **re.findall(pattern, string)**: Find all substrings where the `pattern` matches, and return them as a list.

In the base library, the strings already have similar methods like `str.split` and `str.replace`.

`str.split` is similar to `re.split`, `str.replace` is similar to `re.sub`.

However, the regular expressions `re.split` and `re.sub` are much more powerful!

In [None]:
s = '''The re module was added in Python 1.5, 
and provides Perl-style regular expression patterns. 
Earlier versions of Python came with the regex module, 
which provided Emacs-style patterns. 
The regex module was removed completely in Python 2.5.'''

<p><a name="sub"></a></p>
### re.sub
[[back to top]](#sections)

We can replace all separators at the same time using regular expression.

- **Question**

Suppose we want to split this sentence into a list in which each element is a word. The separators are `dot(.)`, `dash(-)`, `comma(,)` and `blank space( )`.

- **Solution**

1. Since we cannot split a string by multiple separators, an alternative is replacing all separators with a blank space.

2. Then we can split the replaced text using the blank spaces.

In [None]:
s2 = s
s2 = re.sub('[\n,.-]', ' ', s2)
print(s2)
re.split(' +', s2) 
# since there are empty characters in the result, we split it by one or more blank space

<p><a name="split"></a></p>
### re.split
[[back to top]](#sections)


A simpler method uses regular expressions to directly split the text by multiple separators.

In [None]:
re.split('[\n ,.-]+', s)

<p><a name="findall"></a></p>
### re.findall
[[back to top]](#sections)

Similar to **re.split**, **re.findall** also works well in this case.

Just select letters in the string `s` by using **re.findall**.

In [None]:
re.findall('[a-zA-Z]+', s) # if you want number too, run re.findall('[a-zA-Z0-9]+', s) 

### Special sequence in regular expression

****

There are some special sequences that have special meaning in regular expression.

- `\d`:
Matches any decimal digit; this is equivalent to the class [0-9].
- `\D`:
Matches any non-digit character; this is equivalent to the class [^0-9].

- `\w`:
Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
- `\W`:
Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

- `\s`:
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
- `\S`:
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

So the simplest way to solve the problem is:

In [None]:
re.findall('\w+', s) # same as re.findall(`[a-zA-Z0-9_]+`, s)

<p><a name="example"></a></p>
## Example: WordCount
[[back to top]](#sections)


- Now let's rewrite a function wordCount.
- You can use the convenient `Counter()` data type from the `collections` module. See the details [here](https://docs.python.org/3/library/collections.html#collections.Counter). 

In [None]:
from collections import Counter
import re

def wordCount(x, number=False):
    '''
    x: string to count
    number: whether to count the numbers
    '''
    ## tolower and find words
    x = x.lower()
    if number:
        word_list = re.findall('\w+', x)
    else:
        word_list = re.findall('[a-z]+', x)
    ## count and return
    result = {}
    for word in word_list:
        if word in result.keys():
            result[word] += 1
        else:
            result[word] = 1
    return result


def wordCount(x, number=False):
    '''
    x: string to count
    number: whether to count the numbers
    '''
    ## tolower and find words
    x = x.lower()
    if number:
        word_list = re.findall('\w+', x)
    else:
        word_list = re.findall('[a-z]+', x)
        
    ## Convert the word_list to a Counter object
    result = Counter(word_list)
    
    return result 

<p><a name="email"></a></p>
## Example: Is it an email address?
[[back to top]](#sections)

- **Rule #1**
 - The first part of an email address contains at least one of the following: uppercase letters, lowercase letters, the numbers 0-9, periods (.), plus signs (+), orunderscores (_).
  - Regex: **`^[a-zA-Z0-9_.+-]+`** 
  - By putting all these possible sequences and symbols in brackets (as opposed to parentheses) we are saying “this symbol can be any one of these things we’ve listed in the brackets.”
  
  
- **Rule #2**
 - After this, the email address contains the @ symbol.
 - Regex: **`@`**

- **Rule #3**
 - The email address then must contain at least one uppercase or lowercase letter.
 - Regex: **`[A-Za-z]+`**
 

- **Rule #4**
 - This is followed by a period (.).
 - Regex: **`\.`**
 
 
- **Rule #5**
 - Finally, the email address ends with com, org, edu, or net (in reality, there are many possible top-level domains, but, these four should suffice for the sake of example).
 - Regex: **`(com|org|edu|net)$`**
 - This lists the possible sequences of letters that can occur after the period in the second part of an email address.

- By concatenating all of the rules, we arrive at the regular expression:

`^[a-zA-Z0-9_.+-]+@[A-Za-z]+\.(com|org|edu|net)$`

- Let's wrap it up in a function.

In [None]:
import re
def isEmail(x):
    x = x.lower() # case insensitive
    emailPattern = "^[a-zA-Z0-9_.+-]+@[A-Za-z]+\.(com|org|edu|net)$"
    result = re.search(emailPattern, x) != None
    return result

In [None]:
emails = ['some&name9@gmail.com', 'some_name@yahoo.com', 'contact@supstat.com',\
          'some.n@ame@an-email.com', 'some.name@an.email.com']
for i in emails:
    print ('%25s is a valid e-mail address: %s'%(i, isEmail(i)))

- For people who are interested in how email validation is used in the industry, you can check some of the protocols mentioned in this [Stackoverflow answer](https://stackoverflow.com/a/201378).