In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab09.ipynb")

# Lab 9: Text Wrangling and Regular Expressions

In this lab you will get some practice using regular expressions. 

In [None]:
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
import os

plt.style.use('fivethirtyeight') # Use plt.style.available to see more styles
sns.set()
sns.set_context("talk")
plt.rcParams['figure.figsize'] = (8, 5)
%matplotlib inline

<br/><br/>
<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />


## Part 1: Practice with Regular Expressions 

Regular expressions can be tricky, and the best way to gain familiarity with them is through lots of practice. In this question, you will work through ten exercises, each of which requires you to write a regular expression that matches strings that satisfy certain criteria. Make sure to take a close look at the doctests for each function in `lab.py`, as they provide useful guidance for the types of strings you should and shouldn't match.

***Notes:*** 
- Make sure to refer to the [Regular Expression Resources](https://dsc80.com/resources/#regular-expressions) posted on the course website. In particular, we recommend having [regex101.com](https://regex101.com/) open while working, along with the [cheat sheet](https://dsc80.com/resources/other/berkeley-regex-reference.pdf).

- Each exercise has a star rating, between 1 and 3 stars, indicating its difficulty level (1 being the easiest, 3 being the hardest). If you are spending lots of time on 1-star exercises, take a close look at the syntax from lecture, as there is probably an easier way of writing the necessary pattern!

- The function to match patterns uses the `re.search` function.  From it's [definition](https://docs.python.org/3/library/re.html#re.search), scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding Match.  Be aware of this behavior if the pattern you are looking for must be at the start of the string. 

<br> 

### Question 1: (1 star) 

Write a regular expression that matches strings that have `'['` as the third character and `']'` as the sixth character. 

In [None]:
def match_1(string):
    """
    DO NOT EDIT THE DOCSTRING!
    >>> match_1("abcde]")
    False
    >>> match_1("ab[cde")
    False
    >>> match_1("a[cd]")
    False
    >>> match_1("ab[cd]")
    True
    >>> match_1("1ab[cd]")
    False
    >>> match_1("ab[cd]ef")
    True
    >>> match_1("1b[#d] _")
    True
    """
    pattern = ...

    # Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [None]:
grader.check("q11")

### Question 2: (1 star)

Write a regular expression that matches strings that are phone numbers that start with `'(906)'` and follow the format `'(xxx) xxx-xxxx'` (`'x'` represents a digit).

***Note:*** There is a space between `'(xxx)'` and `'xxx-xxxx'`.

In [None]:
def match_2(string):
    """
    DO NOT EDIT THE DOCSTRING!
    >>> match_2("(123) 456-7890")
    False
    >>> match_2("906-456-7890")
    False
    >>> match_2("(906)45-7890")
    False
    >>> match_2("(906) 456-7890")
    True
    >>> match_2("(906)456-789")
    False
    >>> match_2("(906)456-7890")
    False
    >>> match_2("a(906) 456-7890")
    False
    >>> match_2("(906) 456-7890b")
    False
    """
    pattern = ...

    # Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [None]:
grader.check("q12")

### Question 3: (1 star) 

Write a regular expression that matches strings that:
- are between 6 and 10 characters long (inclusive),
- contain only alphanumeric characters, whitespace and `'?'`, and
- end with `'?'`.


In [None]:
def match_3(string):
    """
    DO NOT EDIT THE DOCSTRING!
    >>> match_3("qwertsd?")
    True
    >>> match_3("qw?ertsd?")
    True
    >>> match_3("ab c?")
    False
    >>> match_3("ab   c ?")
    True
    >>> match_3(" asdfqwes ?")
    False
    >>> match_3(" adfqwes ?")
    True
    >>> match_3(" adf!qes ?")
    False
    >>> match_3(" adf!qe? ")
    False
    """
    pattern = ...

    # Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [None]:
grader.check("q13")

### Question 4: (2 star) 

Write a regular expression that matches strings with exactly two `'$'`, one of which is at the start of the string, such that:
- the characters between the two `'$'` can be anything (including nothing) except the lowercase letters `'a'`, `'b'`, and `'c'`, (and `'$'`), and
- the characters after the second `'$'` can only be the **lowercase or uppercase** letters `'a'`/`'A'`, `'b'`/`'B'`, and `'c'`/`'C'`, with every `'a'`/`'A'` before every `'b'`/`'B'`, and every `'b'`/`'B'` before every `'c'`/`'C'`. There must be at least one `'a'` or `'A'`, at least one `'b'` or `'B'`, and at least one `'c'` or `'C'`.

In [None]:
def match_4(string):
    """
    DO NOT EDIT THE DOCSTRING!
    >>> match_4("$$AaaaaBbbbc")
    True
    >>> match_4("$!@#$aABc")
    True
    >>> match_4("$a$aABc")
    False
    >>> match_4("$iiuABc")
    False
    >>> match_4("123$$$Abc")
    False
    >>> match_4("$$Abc")
    True
    >>> match_4("$qw345t$AAAc")
    False
    >>> match_4("$s$Bca")
    False
    >>> match_4("$!@$")
    False
    """
    pattern = ...

    # Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [None]:
grader.check("q14")

### Question 5 : (1 star)

Write a regular expression that matches strings that represent valid Python file names, including the extension. 

***Note:*** For simplicity, assume that file names only contain letters, numbers, and underscores (`'_'`).

***Note:*** Assume a name must start with a letter. 

The Python style guide, PEP 8, says: 

    Package and Module Names Modules should have short, all-lowercase names. Underscores can be used in the module name if it improves readability. Python packages should also have short, all-lowercase names, although the use of underscores is discouraged.

In [None]:
def match_5(string):
    """
    DO NOT EDIT THE DOCSTRING!
    >>> match_5("data2201.py")
    True
    >>> match_5("data2201py")
    False
    >>> match_5("data2201..py")
    False
    >>> match_5("data2201+.py")
    False
    """
    pattern = ...

    # Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [None]:
grader.check("q15")

### Question 6: (1 star) 
Write a regular expression that matches strings that:
- are made up of only lowercase letters and exactly one underscore (`'_'`), and
- have at least one lowercase letter on both sides of the underscore.

In [None]:
def match_6(string):
    """
    DO NOT EDIT THE DOCSTRING!
    >>> match_6("aab_cbb_bc")
    False
    >>> match_6("aab_cbbbc")
    True
    >>> match_6("aab_Abbbc")
    False
    >>> match_6("abcdef")
    False
    >>> match_6("ABCDEF_ABCD")
    False
    """
    pattern = ...

    # Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [None]:
grader.check("q16")

### Question 7: (1 star) 

Write a regular expression that matches strings that start with and end with an underscore (`'_'`).

In [None]:
def match_7(string):
    """
    DO NOT EDIT THE DOCSTRING!
    >>> match_7("_abc_")
    True
    >>> match_7("abd")
    False
    >>> match_7("bcd")
    False
    >>> match_7("_ncde")
    False
    """
    pattern = ...

    # Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [None]:
grader.check("q17")

### Question 8: (1 star) 

Apple serial numbers are strings of length 1 or more that are made up of any characters, other than
- the uppercase letter `'O'`, 
- the lowercase letter `'i`', and 
- the number `'1'`.

Write a regular expression that matches strings that are valid Apple serial numbers.

In [None]:
def match_8(string):
    """
    DO NOT EDIT THE DOCSTRING!
    >>> match_8("ASJDKLFK10ASDO")
    False
    >>> match_8("ASJDKLFK0ASDo!!!!!!! !!!!!!!!!")
    True
    >>> match_8("JKLSDNM01IDKSL")
    False
    >>> match_8("ASDKJLdsi0SKLl")
    False
    >>> match_8("ASDJKL9380JKAL")
    True
    """
    pattern = ...

    # Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [None]:
grader.check("q18")

### Question 9: (2 star) 

ID numbers are formatted as `'SC-NN-CCC-NNNN'`, where 
- SC represents state code in uppercase (e.g. `'CA'`),
- NN represents a number with 2 digits (e.g. `'98'`),
- CCC represents a three letter city code in uppercase (e.g. `'HOU'`), and
- NNNN represents a number with 4 digits (e.g. `'1024'`).

Write a regular expression that matches strings that are ID numbers corresponding to the cities of `'CHI'` or `'HOU'`, or the state of `'WI'`. Assume that there is only one city named `'CHI'` and only one city named `'HOU'`.

In [None]:
def match_9(string):
    '''
    DO NOT EDIT THE DOCSTRING!
    >>> match_9('WI-32-MAD-1232')
    True
    >>> match_9('wi-23-EUC-1231')
    False
    >>> match_9('MA-36-BOS-5465')
    False
    >>> match_9('CA-56-LAX-7895')
    False
    >>> match_9('WI-32-HOU-0000') # If the state is WI, the city can be any 3 letter code, including HOU or CHI!
    True
    >>> match_9('IL-32-CHI-4491')
    True
    '''
    pattern = ...

    # Do not edit following code
    prog = re.compile(pattern)
    return prog.search(string) is not None

In [None]:
grader.check("q19")

### Question 10: (2 star) 

Place your answer `True` or `False` in variable `q10` to the following question. 

The following code will match only the first email(up to the @ sign) in the string?

```python 
stri = 'From: Olivia.Rodrigo@yahoo.com, badbunny@hotmail.com, taylorswift@gmail.com'
stri = stri.rstrip()
print(re.findall('From:.+@', stri))
```

In [None]:
q10 = ...

In [None]:
grader.check("q110")

<!-- BEGIN QUESTION -->

#### Question 1.10a

Briefly explain (less than 12 words) your answer above.  

**A** *Enter your answer here.*

<!-- END QUESTION -->

### Question 11 - Pattern Match 

Create a regular expression pattern that matches all the positive examples below, but none of the negative examples.  You can not simply list the positives strings "or"ed together. 

| Positive | Negative | 
|----------|----------|
| pit      | pt       | 
| spot     | Pot      |
| spate    | peat     | 
| slap two | part     | 
| respite  | SLIP ten |

In [None]:
cases = ['pit', 'spot', 'spate', 'slap two', 'respite', 'pt', 'Pot', 'peat', 
         'part', 'SLIP ten']
positive, negative = [], []
pat = r'...'      # Write regular expression pattern here 

# DO NOT CHANGE BELOW
print('Positive Cases: \n')
for ex in cases: 
    match = re.search(pat, ex)
    if ex=="pt": 
        print("\nNegative Cases: \n")
    if match: 
        print("%9s: found" % ex)
        positive.append(ex)
    else: 
        print("%9s: not found" % ex)
        negative.append(ex)

In [None]:
grader.check("q111")

<br><br>

<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />

## Congratulations! You have finished Lab 09!


Congrats! You are finished with this assignment.

Below, you will see a cell. Running this cell will automatically generate a zip file with your autograded answers. 

**You are responsible for ensuring your submission follows our requirements. We will not be granting regrade requests nor extensions to submissions that don't follow instructions.** If you encounter any difficulties with submission, please don't hesitate to reach out to staff prior to the deadline. 

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)