# Lab 6: Regular Expressions

## Course Policies

**Collaboration Policy**

Data science is a collaborative activity. While you may talk with others about
the homework, we ask that you **write your solutions individually**. If you do
discuss the assignments with others please **include their names** at the top
of your solution.


## Due Date

This assignment is due at **11:59pm Sunday, October 13th**.

# Collaborators

Write names in this cell:

In [1]:
import pandas as pd
import numpy as np
import re
import sqlalchemy

from dsua_112_utils import fetch_and_cache
from pathlib import Path

## Objectives for Lab 6:

This lab has two main parts. 

You will practice the basic usage of regular expressions and also learn to use `re` module in Python.  Some of the materials are based on the tutorial at http://opim.wharton.upenn.edu/~sok/idtresources/python/regex.pdf. As you work through the first part of the lab, you may also find the website http://regex101.com helpful. 


---
# Regular Expressions

We'll start by learning about the simplest possible regular expressions. Since regular expressions are used
to operate on strings, we'll start with the most common task: matching characters.

Most letters and characters will simply match themselves. For example, the regular expression `r'test'` will match the string `test` exactly. There are exceptions to this rule; some characters are special, and don't match themselves.

Here is a list of metacharacters that are widely used in regular experssion. 

<table border="1" class="dataframe" >
<thead>
  <tr style="text-align: right;">
    <th>Pattern </th>
    <th>Description</th> 
  </tr>
 </thead>
 <tbody>
  <tr>
    <td>^</td>
    <td>Matches beginning of line.</td> 
  </tr>
  <tr>
    <td>$</td>
    <td>Matches end of line.</td> 
  </tr>
  <tr>
    <td>.</td>
    <td>Matches any single character except newline. </td> 
  </tr>
  <tr>
    <td>*</td>
    <td>Matches 0 or more occurrences of preceding expression.</td>
  </tr>
  <tr>
    <td>+</td>
    <td>Matches 1 or more occurrence of preceding expression.</td>
  </tr>
  <tr>
    <td>?</td>
    <td>Matches 0 or 1 occurrence of preceding expression.</td>
  </tr>
  <tr>
    <td>[...]</td>
    <td>Matches any single character in brackets.</td>
  </tr>
  <tr>
    <td>[^...]</td>
    <td>Matches any single character <b>not</b> in brackets.</td>
  </tr>
  <tr>
    <td>{n}</td>
    <td>Matches exactly n number of occurrences of preceding expression.</td>
  </tr>
  <tr>
    <td>{n,}</td>
    <td>Matches n or more occurrences of preceding expression.</td>
  </tr>
  <tr>
    <td>{n,m}</td>
    <td>Matches at least n and at most m occurrences of preceding expression.</td>
  </tr>
  <tr>
    <td>a|b</td>
    <td>Matches either a or b.</td>
  </tr>
  <tr>
    <td>\1...\9</td>
    <td>Matches n-th grouped subexpression.</td>
  </tr>
  </tbody>
</table>


Perhaps the most important metacharacter is the backslash, ‘\’. As in Python string literals, the backslash
can be followed by various characters to signal various special sequences. It’s also used to escape all the
metacharacters so you can still match them in patterns; for example, if you need to match a `[` or `\`, you
can precede them with a backslash to remove their special meaning:  `\[` or `\\`. 

The following predefined special sequences are available:

<table border="1" class="dataframe" >
<thead>
  <tr style="text-align: right;">
    <th>Pattern </th>
    <th>Description</th> 
  </tr>
 </thead>
 <tbody>
  <tr>
    <td>\d</td>
    <td>Matches any decimal digit; this is equivalent to the class `[0-9]`</td> 
  </tr>
  <tr>
    <td>\D</td>
    <td>Matches any non-digit character; this is equivalent to the class `[^0-9]`.</td> 
  </tr>
  <tr>
    <td>\s</td>
    <td>Matches any whitespace character; this is equivalent to the class `[ \t\n\r\f\v]` </td> 
  </tr>
  <tr>
    <td>\S</td>
    <td>Matches any non-whitespace character; this is equivalent to the class `[^ \t\n\r\f\v]`.</td>
  </tr>
  <tr>
    <td>\w</td>
    <td>Matches any alphanumeric character; this is equivalent to the class `[a-zA-Z0-9_]`</td>
  </tr>
  <tr>
    <td>\W</td>
    <td>Matches any non-alphanumeric character; this is equivalent to the class `[^a-zA-Z0-9_]`.</td>
  </tr>
  </tbody>
</table>

# Question 1
In this question, write patterns that match the given sequences. It may be as simple as the common letters on each line.

---
## Question 1a

Write a single regular expression to match the following strings without using the `|` operator.

1. **Match:** `abcdefg`
1. **Match:** `abcde`
1. **Match:** `abc`
1. **Skip:** `c abc`

```
BEGIN QUESTION
name: q1a
```

In [52]:
# BEGIN SOLUTION
regx1 = r"^abcd?e?f?g?" # fill in your pattern
# END SOLUTION

In [53]:
# TEST
"|" not in regx1
# EXPECTED: True

True

In [54]:
# TEST
re.search(regx1, "abc").group()
# EXPECTED: 'abc'

'abc'

In [55]:
# TEST
re.search(regx1, "abcde").group()
# EXPECTED: 'abcde'

'abcde'

In [56]:
# TEST
re.search(regx1, "abcdefg").group()
# EXPECTED: 'abcdefg'

'abcdefg'

In [57]:
# TEST
re.search(regx1, "c abc") is None
# EXPECTED: True

True

---
## Question 1b

Write a single regular expression to match the following strings without using the `|` operator.

1. **Match:** `can`
1. **Match:** `man`
1. **Match:** `fan`
1. **Skip:** `dan`
1. **Skip:** `ran`
1. **Skip:** `pan`

```
BEGIN QUESTION
name: q1b
```

In [58]:
# BEGIN SOLUTION
regx2 = r"[cmf]an" # fill in your pattern
# END SOLUTION

In [59]:
# TEST
"|" not in regx2
# EXPECTED: True

True

In [60]:
# TEST
re.match(regx2, 'can').group()
# EXPECTED: 'can'

'can'

In [61]:
# TEST
re.match(regx2, 'fan').group()
# EXPECTED: 'fan'

'fan'

In [62]:
# TEST
re.match(regx2, 'man').group()
# EXPECTED: 'man'

'man'

In [63]:
# TEST
re.match(regx2, 'dan') is None
# EXPECTED: True

True

In [64]:
# TEST
re.match(regx2, 'ran') is None
# EXPECTED: True

True

In [65]:
# TEST
re.match(regx2, 'pan') is None
# EXPECTED: True

True

# Question 2

Now that we have written a few regular expressions, we are now ready to move beyond matching. In this question, we'll take a look at some methods from the `re` package.

---
## Question 2a:

Write a Python program to extract and print the numbers of a given string. 

1. **Hint:** use `re.findall`
2. **Hint:** use `\d` for digits and one of either `*` or `+`.

```
BEGIN QUESTION
name: q2a
```

In [66]:
text_q2a = "Ten 10, Twenty 20, Thirty 30"

# BEGIN SOLUTION
res_q2a = re.findall("[\d]+",text_q2a)
# END SOLUTION

res_q2a

['10', '20', '30']

In [67]:
# TEST
res_q2a
# EXPECTED: ['10', '20', '30']

['10', '20', '30']

---
## Question 2b:

Write a Python program to replace at most 2 occurrences of space, comma, or dot with a colon.

**Hint:** use `re.sub(regex, "newtext", string, number_of_occurences)`

```
BEGIN QUESTION
name: q2b
```

In [68]:
text_q2b = 'Python Exercises, PHP exercises.'

# BEGIN SOLUTION
res_q2b = re.sub(r'[,. ]',':',text_q2b,2)
# END SOLUTION

res_q2b

'Python:Exercises: PHP exercises.'

In [69]:
# TEST
res_q2b
# EXPECTED: 'Python:Exercises: PHP exercises.'

'Python:Exercises: PHP exercises.'

---
## Question 2c: 

Write a Python program to extract values between quotation marks of a string.

```
BEGIN QUESTION
name: q2c
```

In [71]:
text_q2c = '"Python", "PHP", "Java"'

# BEGIN SOLUTION
res_q2c = re.findall(r'"(.*?)"', text_q2c)
# END SOLUTION

res_q2c

['Python', 'PHP', 'Java']

In [72]:
# TEST
res_q2c 
# EXPECTED: ['Python', 'PHP', 'Java']

['Python', 'PHP', 'Java']

## Question 2d:

Write a regular expression to extract and print the quantity and type of objects in a string. You may assume that a space separates quantity and type, ie. `"{quantity} {type}"`. See the example string below for more detail.

1. **Hint:** use `re.findall`
2. **Hint:** use `\d` for digits and one of either `*` or `+`.

```
BEGIN QUESTION
name: q2d
```

In [74]:
text_q2d = "I've got 10 eggs that I stole from 20 gooses belonging to 30 giants."


# BEGIN SOLUTION
res_q2d = re.findall("\d+\ \w+",text_q2d)
# END SOLUTION

res_q2d

['10 eggs', '20 gooses', '30 giants']

In [75]:
# TEST
res_q2d
# EXPECTED: ['10 eggs', '20 gooses', '30 giants']

['10 eggs', '20 gooses', '30 giants']

## Question 2e (optional):

Write a regular expression to replace all words that are not `"mushroom"` with `"badger"`.

In [77]:
text_qe = 'this is a word mushroom mushroom'

# BEGIN SOLUTION
res_qe = re.sub(r"\b(?!mushroom)\b\S+",r"badger",text_qe)
# END SOLUTION
res_qe
#EXPECTED: 'badger badger badger badger mushroom mushroom'

'badger badger badger badger mushroom mushroom'

For additional references on the above problem, see https://www.youtube.com/watch?v=NL6CDFn2i3I