# Lab 5: Regular Expression

** If you are not attending lab, this assignment is due 09/26/2017 at 11:59pm (graded on accuracy) **

** If you are attending lab, you do not need to submit the assignment; you just need to get checked off by your TA. **

In [1]:
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
sns.set()
sns.set_context("talk")
%matplotlib inline

In [2]:
# These lines load the tests.
# !pip install -U okpy

from IPython.display import display, Latex, Markdown
from client.api.notebook import Notebook
ok = Notebook('lab05.ok')

Assignment: Lab 05
OK, version v1.12.10



In [3]:
# ok.auth(force=True)
ok.auth(force=False)

Successfully logged in as sungbin.andy.kang@berkeley.edu


# Objective of Lab 5

In this lab you will practice the basic usage of regular expressions and also learn to use `RE` module in Python.  Some of the materials are based on the tutorial at http://opim.wharton.upenn.edu/~sok/idtresources/python/regex.pdf

As you work through this lab you may find the website http://regex101.com helpful.

---
# Question 1:  Matching Characters

We'll start by learning about the simplest possible regular expressions. Since regular expressions are used
to operate on strings, we'll start with the most common task: matching characters.

Most letters and characters will simply match themselves. For example, the regular expression `r'test'` will match the string `test` exactly. There are exceptions to this rule; some characters are special, and don't match themselves.

Here is a list of metacharacters that are widely used in regular experssion. 

<table border="1" class="dataframe" >
<thead>
  <tr style="text-align: right;">
    <th>Pattern </th>
    <th>Description</th> 
  </tr>
 </thead>
 <tbody>
  <tr>
    <td>^</td>
    <td>Matches beginning of line.</td> 
  </tr>
  <tr>
    <td>$</td>
    <td>Matches end of line.</td> 
  </tr>
  <tr>
    <td>.</td>
    <td>Matches any single character except newline. </td> 
  </tr>
  <tr>
    <td>*</td>
    <td>Matches 0 or more occurrences of preceding expression.</td>
  </tr>
  <tr>
    <td>+</td>
    <td>Matches 1 or more occurrence of preceding expression.</td>
  </tr>
  <tr>
    <td>?</td>
    <td>Matches 0 or 1 occurrence of preceding expression.</td>
  </tr>
  <tr>
    <td>[...]</td>
    <td>Matches any single character in brackets.</td>
  </tr>
  <tr>
    <td>[^...]</td>
    <td>Matches any single character <b>not</b> in brackets.</td>
  </tr>
  <tr>
    <td>{n}</td>
    <td>Matches exactly n number of occurrences of preceding expression.</td>
  </tr>
  <tr>
    <td>{n,}</td>
    <td>Matches n or more occurrences of preceding expression.</td>
  </tr>
  <tr>
    <td>{n,m}</td>
    <td>Matches at least n and at most m occurrences of preceding expression.</td>
  </tr>
  <tr>
    <td>a|b</td>
    <td>Matches either a or b.</td>
  </tr>
  <tr>
    <td>\1...\9</td>
    <td>Matches n-th grouped subexpression.</td>
  </tr>
  </tbody>
</table>


Perhaps the most important metacharacter is the backslash, ‘\’. As in Python string literals, the backslash
can be followed by various characters to signal various special sequences. It’s also used to escape all the
metacharacters so you can still match them in patterns; for example, if you need to match a `[` or `\`, you
can precede them with a backslash to remove their special meaning:  `\[` or `\\`. 

The following predefined special sequences are available:

<table border="1" class="dataframe" >
<thead>
  <tr style="text-align: right;">
    <th>Pattern </th>
    <th>Description</th> 
  </tr>
 </thead>
 <tbody>
  <tr>
    <td>\d</td>
    <td>Matches any decimal digit; this is equivalent to the class `[0-9]`</td> 
  </tr>
  <tr>
    <td>\D</td>
    <td>Matches any non-digit character; this is equivalent to the class `[^0-9]`.</td> 
  </tr>
  <tr>
    <td>\s</td>
    <td>Matches any whitespace character; this is equivalent to the class `[ \t\n\r\f\v]` </td> 
  </tr>
  <tr>
    <td>\S</td>
    <td>Matches any non-whitespace character; this is equivalent to the class `[^ \t\n\r\f\v]`.</td>
  </tr>
  <tr>
    <td>\w</td>
    <td>Matches any alphanumeric character; this is equivalent to the class `[a-zA-Z0-9_]`</td>
  </tr>
  <tr>
    <td>\W</td>
    <td>Matches any non-alphanumeric character; this is equivalent to the class `[^a-zA-Z0-9_]`.</td>
  </tr>
  </tbody>
</table>

In this question, try writing a pattern that matches all three rows, *it may be as simple as the common letters on each line*.

---
<br/><br/>
### Question 1 a

Write a single regular expression to match the following strings without using the `|` operator.

1. **Match:** `abcdefg`
1. **Match:** `abcde`
1. **Match:** `abc`


In [4]:
# Task 1 
# Match abcdefg
# Match abcde
# Match abc
regx1 = r"abcd?e?f?g?" # fill in your pattern

In [5]:
_ = ok.grade('q01a')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'lab05.ipynb'.
Backup... 100% complete
Backup successful for user: sungbin.andy.kang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/lab05/backups/2kmnW1
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



---
<br/><br/>
### Question 1 b

Write a single regular expression to match all strings containing the sequence `123` for example:

1. **Match:** `abc123xyz`
1. **Match:** `define "123"`
1. **Match:** `var g = 123`
1. **Skip:** `var g = 124`

In [8]:
# Task 2
# Match abc123xyz
# Match define "123"
# Match var g = 123;
regx2 = r".*123.*" # fill in your pattern

In [9]:
_ = ok.grade('q01b')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'lab05.ipynb'.
Backup... 100% complete
Backup successful for user: sungbin.andy.kang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/lab05/backups/qx0PA2
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



---
<br/><br/>
### Question 1 c

Write a single regular expression to match the following strings without using the `|` operator.

1. **Match:** `can`
1. **Match:** `man`
1. **Match:** `fan`
1. **Skip:** `dan`
1. **Skip:** `ran`
1. **Skip:** `pan`

In [10]:
# Task 3
# Match can
# Match man
# Match fan
# Skip dan
# Skip ran
# Skip pan
regx3 = r"[cmf]an" # fill in your pattern

In [11]:
_ = ok.grade('q01c')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'lab05.ipynb'.
Backup... 100% complete
Backup successful for user: sungbin.andy.kang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/lab05/backups/qx0PM2
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



---
<br/><br/>
### Question 1 d

Write a single regular expression to match the following strings without using the `|` operator.

1. **Match:** `1 file found.`
1. **Match:** `2 files found.`
1. **Match:** `3 files found.`
1. **Skip:** `No files found.`

In [14]:
# Match 1 file found.
# Match 2 files found.
# Match 24 files found.
# Skip No files found.
regx4 = r"\d.*" # fill in your pattern

In [15]:
_ = ok.grade('q01d')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'lab05.ipynb'.
Backup... 100% complete
Backup successful for user: sungbin.andy.kang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/lab05/backups/pYvO72
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



---
# Question 2:  Text Manipulation

Now that we have gone through basics of `RE`, we are now ready to move beyond matching. 

---
<br></br>
## Question 2a:

Write a Python program to separate and print the numbers of a given string. 

In [18]:
text_q2a = "Ten 10, Twenty 20, Thirty 30"
res_q2a = re.findall(r'\d+', text_q2a) # Hint: use re.findall()
res_q2a

['10', '20', '30']

In [19]:
_ = ok.grade('q02a')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'lab05.ipynb'.
Backup... 100% complete
Backup successful for user: sungbin.andy.kang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/lab05/backups/rkxQ5k
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



---
<br></br>
## Question 2b:

Write a Python program to replace at most 2 occurrences of space, comma, or dot with a colon.

In [37]:
text_q2b = 'Python Exercises, PHP exercises.'
res_q2b = re.sub(pattern=r'\W', repl=r':', string=text_q2b, count=2) # Hint: use re.sub()
res_q2b

'Python:Exercises: PHP exercises.'

In [38]:
_ = ok.grade('q02b')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'lab05.ipynb'.
Backup... 100% complete
Backup successful for user: sungbin.andy.kang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/lab05/backups/yPG1jE
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



---
<br></br>
## Question 2c: 

Write a Python program to extract values between quotation marks of a string

In [76]:
text_q2c = '"Python", "PHP", "Java"'
res_q2c = re.findall(r'\w.*?\b', text_q2c) # Hint: use re.findall()
res_q2c

['Python', 'PHP', 'Java']

In [77]:
_ = ok.grade('q02c')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'lab05.ipynb'.
Backup... 100% complete
Backup successful for user: sungbin.andy.kang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/lab05/backups/n5qnXR
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



---
<br></br>
## Question 2d:

Write a python program to convert snake case (e.g. `ds_lab`) string to camel case string (e.g. `DsLab`). 

In [80]:
text_q2d = 'python_exercises' 
res_q2d = re.split(r'_', text_q2d)
res_q2d = res_q2d[0].title() + res_q2d[1].title()
# Hint: use re.split()
res_q2d

'PythonExercises'

In [81]:
_ = ok.grade('q02d')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'lab05.ipynb'.
Backup... 100% complete
Backup successful for user: sungbin.andy.kang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/lab05/backups/jRm7Nz
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



---
<br></br>
## Question 2e:

Remove everything inside of the parentheses in a string with pandas.

In [99]:

df_q2e = pd.DataFrame(pd.Series(["example (.com)", "w3resource", "github (.com)", "stackoverflow (.com)"], name="q2e"))
#res_q2e = df_q2e.str.replace(r'(.*?)', "") # Hint: use df.str.replace()
#res_q2e
df_q2e['q2e'] = df_q2e["q2e"].str.replace(r'\(.*?\)', "").str.replace(" ", "")
res_q2e = df_q2e['q2e']
res_q2e[0]

'example'

In [100]:
_ = ok.grade('q02e')
_ = ok.backup()

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'lab05.ipynb'.
Backup... 100% complete
Backup successful for user: sungbin.andy.kang@berkeley.edu
URL: https://okpy.org/cal/ds100/fa17/lab05/backups/ADGjZj
NOTE: this is only a backup. To submit your assignment, use:
	python3 ok --submit



## Submission

Congrats! You are finished with this assignment. For convenience, we've included a cell below that runs all the OkPy tests.

In [101]:
import os
print("Running all tests...")
_ = [ok.grade(q[:-3]) for q in os.listdir("ok_tests") if q.startswith('q')]

Running all tests...
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 1
    Failed: 0
[ooooooooook] 100.0% passed

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running t

Now, run the cell below to submit your assignment to OkPy. The autograder should email you shortly with your autograded score. The autograder will only run once every 30 minutes.

**If you're failing tests on the autograder but pass them locally**, you should simulate the autograder by doing the following:

1. In the top menu, click Kernel -> Restart and Run all.
2. Run the cell above to run each OkPy test.

**You must make sure that you pass all the tests when running steps 1 and 2 in order.** If you are still failing autograder tests, you should double check your results.

In [None]:
_ = ok.submit()

<IPython.core.display.Javascript object>