<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 15px;">
### Regular Expressions

**Week 7 | Lesson 4.1**

---
| TIMING  | TYPE  
|:-:|---|---|
| 25 min| [Review: String Methods](#review) |
| 10 min| [History of RegEx and Applications](#hook) |
| 45 min| [Regular Expressions in Python](#content) |
| 20 min| [Conclusion](#conclusion) |
| 5 min | [Additional Resources](#more)

---

### Lesson Objectives
*After this lesson, you will be able to:*
- Use Regular Expressions for Pattern Matching on Text Data 
- Compare and contrast `regex` with previous methods of searching and splitting on strings 

---
### Student Pre-Work 

*Before this lesson, you should already be able to:*
- Parse Strings using `split` and `strip` methods 


---
<a name="review"></a>
### Review: Methods on Strings 

> **Exercise:** What are the `string` constants?
     - string.ascii_letters
     - string.ascii_lowercase 
     - string.ascii_uppercase 
     - string.digits
    
    
> **Exercise:** What are the functionalities of deprecated string functions?
    - string.split()
    - string.split()
    - string.replace()
    
    
> **Exercise:** For this exercise solve the following problem on Hackerrank and post your solution in the cell below:<a href=https://www.hackerrank.com/contests/pythonista-practice-session/challenges/validate-list-of-email-address-with-filter> Validating Email Address with a Filter </a>



---
<a name="hook"> </a>

### Applications: History of Regular Expressions 

A **regular expression** is a pattern or sequence of characters that has regular characters and metacharacters. It serves a **template** pattern to be searched or matched in a string. 


### 1950s

<img src=https://brainmoda.files.wordpress.com/2013/01/history.png>

- First Speech Systems
- Foundational Work on Automata, Formal Languages, and Information Theory by Stephen Kleene 
- Little understanding of natural language syntax, semantics, and pragmatics 
---
### 1960s 

<img src=http://thenexttech.startupitalia.eu/wp-content/uploads/sites/10/2016/11/chatbot-eliza.png> 
- Incorporated into `ed` text editor by Ken Thompson
- ELIZA (MIT): Fraudulent NLP in a simple pattern matcher psychotherapist 

---

### 1970s
<img src=http://so.mrozekma.com/unix-grep.png>

- `g/re/p` - global regular expression print incorporated regex in 1979
- Could interpret questions, statements commands


---

### 1980s-1990s
<img src=http://perl-begin.org/IDEs-and-tools/shots/padre/padre-256.png>
- More complicated regexes in Perl 
- Extended complex regexes from Perl into many modern tools including PHP and Apache HTTP

---
### Modern uses for RegEx...
<img src=http://i.imgur.com/W72QsHu.png> 

- Web Search

- Word Processing, find, substitute 

- Validate fields in a database (dates, email addr, URLs)

- Searching Corpus for lingustic patterns and gathering stats

- Database String Matching

---
### **Regular Expressions in Python**

Regular Expression in Python have the following syntax. 

```
import re 
match = re.search(pat, str)
```

**`re.search()`** method takes a regex pattern and searches for that within **the entire** string. 

In [1]:
## Example: use of search  
import re
sample = 'an example word:catt!!'

## match is storing the search result in a variable named match
## if match is false 
match = re.search(r'word:cat', sample) 
# r in the front treats it as a real string
## exercise: what's the difference between '\n' and r'\n'?

if match: 
    print 'found', match.group() ## found word:cat
else:
    print 'did not find'

found word:cat


> **Exercise:** What are all the methods associated with `match` and what do they do?

### Basic Patterns for Regular Expressions 

- `Ordinary Characters`, e.g. `a`, `X`, `9`, can be matched exactly. 
    - **Exercise:** What is the ouptut of:
        - `bool(re.search(r'9', '332'))`
        
        - `bool(re.search(r'33', '332'))`


- ` .`: A period matches any single character except newline `\n`. 
    - **Exercise:** What is the output of:
        - `bool(re.search(r'.', 'Love thy neighbor'))`
        - `bool(re.search(r'.', '233=='))`


- `\w `: matches a "word" character. A letter or digit or underbar `[a-zA-Z0-9\_]`. Note the lowercase `w`

     - **Exercise:** What is the output of:
        - `bool(re.search(r'\w', 'great'))`
        - `bool(re.search(r'\w', '233=='))`
        - `bool(re.search(r'\w', '=='))`



- `\W`: matches any "non-word" character
     - **Exercise:** What is the output of:
         - `bool(re.search(r'\W', 'great'))`
         - `bool(re.search(r'\W', '233=='))`
         - `bool(re.search(r'\W', '=='))`


- `\d`: decimal digit [0-9]
     - **Exercise:** What is the output of:
         - `bool(re.search(r'\d', 'great'))`
         - `bool(re.search(r'\d', '233=='))`
         - `bool(re.search(r'\d', '=='))`
         
         

- `\b`: boundary between word and non-word 
     - **Exercise:** What is the output of:
         - `bool(re.search(r'\bfoo\b', 'foo'))`
         - `bool(re.search(r'\bfoo\b', 'foo.'))`
         - `bool(re.search(r'\bfoo\b', 'foobar'))`
         - `bool(re.search(r'\bfoo\b', 'foo3'))`
         
         
- `\B`: matches the empty string, but only when it is not at the beginning or end of a word
    - **Exercise:** What is the output of:
        - `bool(re.search(r'py\B', 'py'))`
        - `bool(re.search(r'py\B', 'py.'))`
        - `bool(re.search(r'py\B', 'pybar'))`
        - `bool(re.search(r'py\B', 'py3'))`
        
        
- `\s`: matches single white space chracter `[\n\r\t\f]`

- `\S`: matches any non-whitespace character

- `\` inhibits the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.


In [51]:
# print 'Ordinary Matching:'
# ## ordinary character matching 
# print bool(re.search(r'9', '332'))
# print bool(re.search(r'33', '332'))


# print 
# print '. examples:'
# ## output of character level matching 
# print bool(re.search(r'.', 'Love thy neighbor'))
# print bool(re.search(r'.', '233=='))

# print 
# print '\w examples:'
# ## output of word characters 
# print bool(re.search(r'\w', 'great'))
# print bool(re.search(r'\w', '233=='))
# print bool(re.search(r'\w', '=='))

# print 
# print '\d examples:'
# print bool(re.search(r'\d', 'great'))
# print bool(re.search(r'\d', '233=='))
# print bool(re.search(r'\d', '=='))

# print 
# print '\W examples:'
# ## output of non-word characters 
# print bool(re.search(r'\W', 'great'))
# print bool(re.search(r'\W', '233=='))
# print bool(re.search(r'\W', '=='))



# print 
# print r'\b examples:'
# ## Boundary at beg and end 
# print bool(re.search(r'\bfoo\b', 'foo'))
# print bool(re.search(r'\bfoo\b', 'foo.'))       
# print bool(re.search(r'\bfoo\b', 'foobar'))
# print bool(re.search(r'\bfoo\b', 'foo3'))


# print 
# print r'\B examples:'
# ## Boundary not at the beg and end 
# print bool(re.search(r'py\B', 'py'))
# print bool(re.search(r'py\B', 'py.'))
# print bool(re.search(r'py\B', 'pybar'))
# print bool(re.search(r'py\B', 'py3'))

### Quantifiers 

- `X`<font color='red'>*</font>: 0 or more repetitions of `X` 
    ```
     match = re.search(r'pi*', 'piiig')
     match = re.search(r'z*', 'piigiiii')
     match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx') 
     match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx') 
     match = re.search(r'\d\s*\d\s*\d', 'xx123xx') 
    
    ```
    
- `X`<font color='red'>+</font>: 1 or more repetitions of `X` 
    
    ```
     match = re.search(r'pi+', 'piiig')
     match = re.search(r'i+', 'piigiiii')
     match = re.search(r'z+', 'piigiiii')

    ```

- `X`<font color='red'>?</font>: 0 or 1 instances of `X` 

    ```
    match = re.search('cats?', 'cat')
    match = re.search('cats?', 'catss')
    ```
   
  
- `X`<font color='red'>{m}</font>: exactly m instances of `X` 
    
    ```
    match = re.search('s{3}', 'cat')
    match = re.search('s{2}', 'catss')
    ```
- `X`<font color='red'>{m,}</font>: at least `m` instances of `X` 
      
    ```
    match = re.search('s{3,}', 'cat')
    match = re.search('s{2,}', 'catsssss')
    ```
- `X`<font color='red'>{m, n}</font>: between `m` and `n` instances of `X` 
    
    ```
    match = re.search('s{2,3}', 'cat')
    match = re.search('s{4,6}', 'catsssss')
    ```

> **Exercise**: Write a regular expression to extract an email address from inside of the string, e.g.  `'xyz alice-b@google.com'`.

### Character Classes 
Character classes **[...] match any characters in the class.** E.g.

>**`[aeiou]`** matches vowels. 


Use **`^` to specify the complement set**:
> **`[^aeiou]`** matches non-vowels. 


Use `-` to specify the range of letters:
> **`[a-e]`** matches `abcde`

> **`[0-9a-f]`** matches `012345679abcdef`.

> **Exercise**: Write a regular expression to extract an email address from inside of the string using **character classes**, e.g.  `'xyz alice-b@google.com'`.

### Group Extraction

**()**: The "group" feature of a regex allows you to pick out parts of the matching text. Returns one or more subgroups of the match. 

```
>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0)       # The entire match
'Isaac Newton'
>>> m.group(1)       # The first parenthesized subgroup.
'Isaac'
>>> m.group(2)       # The second parenthesized subgroup.
'Newton'
>>> m.group(1, 2)    # Multiple arguments give us a tuple.
('Isaac', 'Newton')

```

<center>
**"A common workflow with regular expressions is that you write a pattern for the thing you are looking for, adding parenthesis groups to extract the parts you want." - Google for Education**
</center>



> **Exercise:** Suppose for the emails problem that we want to extract the username and host separately. 

### Matching on Start and Finish 

**`^`** : Matches the start of the string and in multiline mode matches immediately after each newline. 

```
match = re.search('^\d', '123')
match = re.search('^\d', 'bool')
    
```

`$`: Matches the end of the string or just before the newline at the end of the string, and in MULTILINE. 

    match = re.search('foo$', 'foo')
    
    match = re.search('foo$', 'foobar')


### Backreferences

`\number`: Matches the contents of the group of the same number. 

    match = re.search(r'(.+) \1', 'the the')
    match = re.search(r'(.+) \1', 'the back the')
    match = re.search(r'(.+) \1', 'thethe')
    
    
### Disjunctions 
**`|`**: `A|B` where A and B can be arbitrary REs, creates a regular expression that will match either A or B.

> **Exercise**: What strings are matched by the following regular expression patterns: 
    - r'ab+a'
    - r'(ab)+'
    - r'([^aeiou][aeiou])\1'
    - r'\bdis\w+\b'

### Python `re` Functions 

`re.match` vs. `re.search`: `re.match` only looks at the beginning of the screen. 

`re.compile`: compile a regular expression which can be used for searching or matching. 

> **Exercise**: Let's checkout if the proposed strings from above exercise work. 
    - first_regex = re.compile(r'ab+a')
    - second_regex = re.compile(r'(ab)+')
    - third_regex = re.compile(r'([^aeiou][aeiou])\1')
    - fourth_regex = re.compile(r'\bdis\w+\b')
    
    
`re.findall`: Finds all the matches and returns them as a list of strings. 

      str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
      emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) 
      for email in emails:
        # do something with each found email string
        print email
        
You can use `findall` With Files:
    # Open file
    f = open('test.txt', 'r')
    
    # Feed the file text into findall(); it returns a list of all the found strings
    strings = re.findall(r'some pattern', f.read())


---
`findall and Groups`

The paranthesis () group mechanism can be combined with `findall()`. . If the pattern includes 2 or more parenthesis groups, then instead of returning a list of strings, `findall()` returns a list of *tuples*. Each tuple represents one match of the pattern, and inside the tuple is the `group(1)`, `group(2)` .. data. So if 2 parenthesis groups are added to the email pattern, then `findall()` returns a list of tuples, each length 2 containing the username and host, e.g. `('alice', 'google.com')`.

    str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
    
    tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)
    print tuples  ## [('alice', 'google.com'), ('bob', 'abc.com')]
    for tuple in tuples:
        print tuple[0]  ## username
        print tuple[1]  ## host
        
--- 
`re.sub`: `re.sub(pat, replacement, str)` function searches for all the instances of pattern in the given string, and replaces them. The replacement string can include `'\1'`, `'\2'` which refer to the text from `group(1)`, `group(2)`, and so on from the original matching text.

     str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
      ## re.sub(pat, replacement, str) -- returns new string with all replacements,
      ## \1 is group(1), \2 group(2) in the replacement
      print re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', str)
      ## purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher

### Flags in `re`

In the Python regular expression methods above, you will notice that each of them also take an optional `flags` argument. Most of the available flags are a convenience and can be written into the into the regular expression itself directly, but some can be useful in certain cases.

- `re.IGNORECASE` makes the pattern case insensitive so that it matches strings of different capitalizations
- `re.MULTILINE` is necessary if your input string has newline characters (\n) and allows the start and end metacharacter (`^` and `$` respectively) to match at the beginning and end of each line instead of at the beginning and end of the whole input string
- `re.DOTALL` allows the dot `(.)` metacharacter match all characters, including the newline character `(\n)`.

### Lab: Regular Expression 

<a href=https://www.hackerrank.com/regex-lab-ii> Regular Expression Lab </a>

---
### Conclusion
<a name="conclusion"></a>

- History of `regex` is beautifully linked with the history of theoretical computer science and aritificial intelligence. 

- Learned about standard pattern matching using regex. 

- Learned about methods in the `re` module 

---
<a name="more"></a>
### Additional Resources  

<a href=https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf> RegEx: Deep Dive Princeton </a>

<a href=http://web.mit.edu/hackl/www/lab/turkshop/slides/regex-cheatsheet.pdf> RegEx: Cheat Sheet </a>

<a href=http://www.rexegg.com/regex-style.html> Regular Expression Style Guide </a>