# Part 0 of 2: Simple string processing review

**Exercise ordering:** Each exercise builds logically on previous exercises, but you may solve them in any order. That is, if you can't solve an exercise, you can still move on and try the next one. Use this to your advantage, as the exercises are **not** necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises. 

**Demo cells:** Code cells starting with the comment `### define demo inputs` load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly, but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them, but we did not print them in the starter code.

**Debugging your code:** Right before each exercise test cell, there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects, you may want to print the head or chunks of rows at a time).

**Exercise point breakdown:**

- Exercise 0: **1** point
- Exercise 1: **2** points
- Exercise 2: **2** points
- Exercise 3: **3** points

**Final reminders:** 

- Submit after **every exercise**
- Review the generated grade report after you submit to see what errors were returned
- Stay calm, skip problems as needed, and take short breaks at your leisure


In [1]:
### Global Imports

# Use this cell to import anything common, e.g. numpy, pandas, sqlite3
import string

# Use this cell to bring in the starter data

In [2]:
text = "sgtEEEr2020.0"

In [3]:
# Strings have methods for checking "global" string properties
print("1.", text.isalpha())

# These can also be applied per character
print("2.", [c.isalpha() for c in text])

1. False
2. [True, True, True, True, True, True, True, False, False, False, False, False, False]


In [4]:
# Here are a bunch of additional useful methods
print("BELOW: (global) -> (per character)")
print(text.isdigit(), "-->", [c.isdigit() for c in text])
print(text.isspace(), "-->", [c.isspace() for c in text])
print(text.islower(), "-->", [c.islower() for c in text])
print(text.isupper(), "-->", [c.isupper() for c in text])
print(text.isnumeric(), "-->", [c.isnumeric() for c in text])

BELOW: (global) -> (per character)
False --> [False, False, False, False, False, False, False, True, True, True, True, False, True]
False --> [False, False, False, False, False, False, False, False, False, False, False, False, False]
False --> [True, True, True, False, False, False, True, False, False, False, False, False, False]
False --> [False, False, False, True, True, True, False, False, False, False, False, False, False]
False --> [False, False, False, False, False, False, False, True, True, True, True, False, True]


---

**Exercise 0** (1 point). Create a new function that checks whether a given input string is a properly formatted social security number, i.e., has the pattern, `XXX-XX-XXXX`, _including_ the separator dashes, where each `X` is a digit. It should return `True` if so or `False` otherwise.

In [5]:
### Define demo inputs
demo_str_ex0_0 = '832-38-1847'
demo_str_ex0_1 = '832 -38 -  1847'
demo_str_ex0_2 = '832-bc-3847'
demo_str_ex0_3 = '832381847'

<!-- Expected demo output text block -->
The demos included in the solution cell below should display the following output:
```
is_ssn('832-38-1847') -> True
is_ssn('832 -38 -  1847') -> False
is_ssn('832-bc-3847') -> False
is_ssn('832381847') -> False
```
<!-- Include any shout outs here -->

In [6]:
def is_ssn(s):
    if len(s) != 11 or any(char.isspace() for char in s) or any(char.isalpha() for char in s):

        return False

    if s[3] != '-' or s[6] != '-':
        
        return False
    
    return True

In [7]:
import re

def is_valid_ssn(ssn):
    # Define the SSN pattern
    pattern = r'\d{3}-\d{2}-\d{4}'
    
    # Compile the regular expression pattern
    regex = re.compile(pattern)
    
    # Use re.match to check if the SSN matches the pattern from the beginning of the string
    match = regex.match(ssn)
    
    # If there is a match and the entire string is matched, it's a valid SSN
    if match: 
        return True
    else:
        return False

In [8]:
demo_str_ex0_0 = '832-38-1847'

In [9]:
### demo function call
print(is_ssn(demo_str_ex0_0))
print(is_ssn(demo_str_ex0_1))
print(is_ssn(demo_str_ex0_2))
print(is_ssn(demo_str_ex0_3))

True
False
False
False


<!-- Test Cell Boilerplate -->
The cell below will test your solution for Exercise 0. The testing variables will be available for debugging under the following names in a dictionary format.
- `input_vars` - Input variables for your solution. 
- `original_input_vars` - Copy of input variables from prior to running your solution. These _should_ be the same as `input_vars` - otherwise the inputs were modified by your solution.
- `returned_output_vars` - Outputs returned by your solution.
- `true_output_vars` - The expected output. This _should_ "match" `returned_output_vars` based on the question requirements - otherwise, your solution is not returning the correct output. 

In [10]:
### test_cell_ex0

from tester_fw.testers import Tester

conf = {
    'case_file':'tc_0', 
    'func': is_ssn, # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        's':{
            'dtype':'str', # data type of param.
            'check_modified':False,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'bool',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': True, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'F24oYNyh8kq_wkZD_Oo0ZCPHLcoO-xNXHOYNiPnQmmY=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise


print('Passed! Please submit.')

Passed! Please submit.


---

# Regular expressions

Exercise 0 hints at the general problem of finding patterns in text. A handy tool for this problem is Python's Regular Expression module, `re`.

A _regular expression_ is a specially formatted pattern, written as a string. Matching patterns with regular expressions has 3 steps:

1. You come up with a pattern to find.
2. You compile it into a _pattern object_.
3. You apply the pattern object to a string to find _matches_, i.e., instances of the pattern within the string.

As you read through the examples below, refer also to the [regular expression HOWTO document](https://docs.python.org/3/howto/regex.html) for many more examples and details.

In [11]:
import re

## Basics

Let's see how this scheme works for the simplest case, in which the pattern is an *exact substring*. In the following example, suppose want to look for the substring `'fox'` within a larger input string.

In [12]:
pattern = 'fox'
pattern_matcher = re.compile(pattern)

input_string = 'The quick brown fox jumps over the lazy dog'
matches = pattern_matcher.search(input_string)
print(matches)

<re.Match object; span=(16, 19), match='fox'>


Observe that the returned object, `matches`, is a special object. Inspecting the printed output, notice that the matching text, `'fox'`, was found and located at positions 16-18 of the `input_string`. Had there been no matches, then `.search()` would have returned `None`, as in this example:

In [13]:
print(pattern_matcher.search("This input has a FOX, but it's all uppercase and so won't match."))

None


You can also write code to query the `matches` object for more information.

In [14]:
print(matches.group())
print(matches.start())
print(matches.end())
print(matches.span())

fox
16
19
(16, 19)


---

**Module-level searching.** For infrequently used patterns, you can also skip creating the pattern object and just call the module-level search function, `re.search()`.

In [15]:
matches_2 = re.search('jump', input_string)
assert matches_2 is not None
print ("Found", matches_2.group(), "@", matches_2.span())

Found jump @ (20, 24)


**Other Search Methods.** Besides `search()`, there are several other pattern-matching procedures:

1. `match()`    - Determine if the regular expression (RE) matches at the beginning of the string.
2. `search()`   - Scan through a string, looking for any location where this RE matches.
3. `findall()`  - Find all substrings where the RE matches, and returns them as a list.
4. `finditer()` - Find all substrings where the RE matches, and returns them as an iterator.

We'll use several of these below; again, refer to the [HOWTO](https://docs.python.org/3/howto/regex.html) for more details.

In [16]:
matches_2 = re.findall('jump', input_string)
matches_2

['jump']

In [17]:
shef = "jump on the trampoline and then jump on the castle"
matches_3 = re.findall('jump', shef)
matches_3

['jump', 'jump']

In [18]:
matches_2 = re.finditer('jump', input_string)
matches_2

<callable_iterator at 0x7f91b1286610>

In [19]:
text = "The quick brown fox jumps over the lazy dog."
pattern = r'\b\w{3}\b'  # Match three-letter words

matches = re.finditer(pattern, text)

'''re.finditer() returns an iterator, and you can loop through it to access the match objects. 
Each match object contains information about the matched substring, 
including its start and end positions in the input string.'''

for match in matches:
    print("Match:", match.group(), "at position:", match.start())

Match: The at position: 0
Match: fox at position: 16
Match: the at position: 31
Match: dog at position: 40


The regular expression `r'\b\w{3}\b'` is a pattern used to match three-letter words in a text. Here's a breakdown of its components:

1. `\b`: This is a word boundary anchor. It matches the position where a word starts or ends. It ensures that the matched word is not part of a longer word. For example, it would match "dog" in "The dog barked," but not in "The dogs barked."

2. `\w`: This is a shorthand character class that matches any word character, which includes letters (both uppercase and lowercase), digits, and underscores.

3. `{3}`: This is a quantifier that specifies the exact number of times the preceding pattern (`\w` in this case) should be matched. `{3}` means that it should match exactly three word characters.

4. `\b`: Another word boundary anchor, used at the end of the pattern to ensure that the matched word is not part of a longer word.

So, when you put it all together, `r'\b\w{3}\b'` matches any three-letter word in a text, surrounded by word boundaries. For example, in the sentence "The cat is cute," this regular expression would match the word "cat" because it consists of three letters and is surrounded by word boundaries.

---

## A pattern language

An exact substring is one kind of pattern, but the power of regular expressions is that it provides an entire "_mini-language_" for specifying more general patterns.

To start, read the section of the HOWTO on ["Simple Patterns"](https://docs.python.org/3/howto/regex.html#simple-patterns). We highlight a few constructs below.

In [20]:
# Metacharacter classes
vowels = '[aeiou]'

print(f"Scanning `{input_string}` for vowels, `{vowels}`:")
for match_vowel in re.finditer(vowels, input_string):
    print(match_vowel)

Scanning `The quick brown fox jumps over the lazy dog` for vowels, `[aeiou]`:
<re.Match object; span=(2, 3), match='e'>
<re.Match object; span=(5, 6), match='u'>
<re.Match object; span=(6, 7), match='i'>
<re.Match object; span=(12, 13), match='o'>
<re.Match object; span=(17, 18), match='o'>
<re.Match object; span=(21, 22), match='u'>
<re.Match object; span=(26, 27), match='o'>
<re.Match object; span=(28, 29), match='e'>
<re.Match object; span=(33, 34), match='e'>
<re.Match object; span=(36, 37), match='a'>
<re.Match object; span=(41, 42), match='o'>


The regular expression `[aeiou]` is a character class that matches any single character that is one of the lowercase vowels: 'a', 'e', 'i', 'o', or 'u'. 

Here's what each part of the regular expression does:

- `[ ]`: Square brackets denote a character class, which is a set of characters. In this case, it's a character class containing the vowels.

- `aeiou`: Inside the square brackets, the characters 'a', 'e', 'i', 'o', and 'u' are listed. This means the regular expression will match any single character that is one of these vowels.

For example, if you apply the regular expression `[aeiou]` to the text "Hello," it will match the 'e' in "Hello" because 'e' is one of the vowels listed in the character class.

In [21]:
# Counts: For instance, two or more consecutive vowels:
two_or_more_vowels = vowels + '{2,}'
print(f"Pattern: {two_or_more_vowels}")
print(re.findall(two_or_more_vowels, input_string))

Pattern: [aeiou]{2,}
['ui']


The regular expression `[aeiou]+{2,}` is a pattern used to match consecutive sequences of one or more lowercase vowels (i.e., 'a', 'e', 'i', 'o', or 'u') that are repeated two or more times in a row.

Here's what each part of the regular expression does:

- `[aeiou]`: This is a character class that matches any single lowercase vowel. It means that the regular expression will match a single vowel character.

- `+`: The plus symbol is a quantifier that specifies "one or more" of the preceding pattern. In this case, it means one or more consecutive lowercase vowels.

- `{2,}`: This is another quantifier. `{2,}` specifies "two or more" of the preceding pattern. It means that the regular expression will match sequences of consecutive lowercase vowels that are repeated at least two or more times.

So, for example, in the text "ooaaiiee," the regular expression `[aeiou]+{2,}` would match the sequences "oo," "aa," and "ii," because they consist of two or more consecutive lowercase vowels.

---

In [22]:
# Wildcards
cats = "ca+t"
print(re.search(cats, "is this a ct?"))
print(re.search(cats, "how about this cat?"))
print(re.search(cats, "and this one: caaaaat, yes or no?"))

None
<re.Match object; span=(15, 18), match='cat'>
<re.Match object; span=(14, 21), match='caaaaat'>


The regular expression `"ca+t"` is a pattern that matches strings in which:

- `"c"` appears exactly once at the beginning of the string.
- `"a"` appears one or more times immediately after the `"c"`.
- There may be additional characters following the `"a"`.

Here's how the components of the regular expression work:

- `"c"`: This matches the character "c" literally.

- `"a+"`: This part matches one or more occurrences of the character "a." The `"+"` is a quantifier that means "one or more." So, `"a+"` matches one or more consecutive "a" characters.

Putting it all together, `"ca+t"` matches strings like "cat," "caat," "caaat," and so on, where "c" appears once at the beginning, followed by one or more "a" characters. Examples of matches include:

- "cat"
- "caat"
- "caaat"

Examples of non-matches include:

- "ct" (no "a" after "c")
- "cot" (only one "a" after "c")
- "cazat" (non-"a" characters in between)

In summary, `"ca+t"` is a regular expression that matches words that start with "cat" followed by one or more "a" characters.

---

In [23]:
# Special operator: "or"
adjectives = "lazy|brown"
print(f"Scanning `{input_string}` for adjectives, `{adjectives}`:")
for match_adjective in re.finditer(adjectives, input_string):
    print(match_adjective)

Scanning `The quick brown fox jumps over the lazy dog` for adjectives, `lazy|brown`:
<re.Match object; span=(10, 15), match='brown'>
<re.Match object; span=(35, 39), match='lazy'>


The regular expression `"lazy|brown"` is a pattern that matches either the word "lazy" or the word "brown" in a text. The `|` character is used as an OR operator within the regular expression, allowing it to match either of the specified words.

Here's how it works:

- `"lazy"`: This part of the regular expression matches the exact word "lazy." If "lazy" appears in the text, it's considered a match.

- `"brown"`: This part of the regular expression matches the exact word "brown." If "brown" appears in the text, it's considered a match.

When you use the `|` character between these two words, it means that the regular expression will match either "lazy" or "brown" if either of them appears in the text. 

For example, in the text "The quick brown fox is lazy," the regular expression `"lazy|brown"` will match both "brown" and "lazy" because they both appear in the text.

---

In [24]:
# Predefined character classes
three_digits = '\d\d\d'
print(re.findall(three_digits, "My number is 555-123-4567"))

['555', '123', '456']


The regular expression `'\d\d\d'` is a pattern that matches three consecutive digits (0-9) in a string. Here's what each part of the regular expression does:

- `\d`: This is a shorthand character class that matches any digit (0-9).

- `\d\d\d`: This pattern specifies three consecutive `\d` expressions, which means it matches three digits in a row.

So, `'\d\d\d'` will match any substring in a text that consists of exactly three consecutive digits. For example, in the string "The code is 12345," it will match the "123" portion of the string because it is a sequence of three digits.

> In the previous example, notice that the pattern search proceeds from left-to-right and does not return overlaps: here, the matcher returns `456` but not `567`. In fact, this case is an instance of the default [_greedy behavior_](https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy) of the matcher.

---

In regular expressions (regex), metacharacter classes are special character sequences that represent predefined sets of characters. They provide a convenient way to match commonly used character groups. Here are some of the most commonly used metacharacter classes in regex:

1. `\d`: Matches any digit (0-9).
2. `\D`: Matches any character that is not a digit.
3. `\w`: Matches any word character (alphanumeric character plus underscore `_`).
4. `\W`: Matches any character that is not a word character.
5. `\s`: Matches any whitespace character (including spaces, tabs, and line breaks).
6. `\S`: Matches any character that is not a whitespace character.
7. `\b`: Matches a word boundary, ensuring that a word is not part of a longer word.
8. `\B`: Matches a position that is not a word boundary.
9. `.` (dot): Matches any character except a newline character.
10. `[...]`: A character class that matches any single character from the specified set. For example, `[aeiou]` matches any vowel.
11. `[^...]`: A negated character class that matches any single character not in the specified set. For example, `[^0-9]` matches any character that is not a digit.
12. `|` (pipe): An OR operator that allows you to specify multiple alternatives. For example, `cat|dog` matches either "cat" or "dog."

These metacharacter classes can be combined and used with other regular expression elements to create more complex patterns for text matching. They provide a powerful way to search for and manipulate text based on patterns and character classes.

---

**The backslash plague.** In the "three-digits" example, we used the predefined metacharacter class, `'\d'`, to match slashes. But what if you want to match a _literal_ slash? The HOWTO describes how things can get out of control in its subsection on ["The Backslash Plague"](https://docs.python.org/3/howto/regex.html#the-backslash-plague), which occurs because the Python interpreter processes backslashes in string literals (e.g., so that `\t` expands to a tab character and `\n` to a newline) while the regular expression processor also gives backslashes meaning (e.g., so that `\d` is a digit metaclass).

For example, suppose you want to look for the text string, `\section`, in some input string. Which of the following will match it? Recall that `\s` is a predefined metacharacter class that matches any whitespace character.

In [25]:
input_with_slash_section = "This string contains `\section`, which we would like to match."

print(f"Searching: {input_with_slash_section}")

print(re.search("\section", input_with_slash_section))
print(re.search("\\section", input_with_slash_section))
print(re.search("\\\\section", input_with_slash_section))

Searching: This string contains `\section`, which we would like to match.
None
None
<re.Match object; span=(22, 30), match='\\section'>


To help mitigate this case, Python provides a special type of string called a _raw string_, which is a string literal prefixed by the letter `r`. For such strings, the Python interpreter will not process the backslash.

> Although the interpreter won't process the backslash, the regular expression processor will do so. As such, the pattern string still needs _two_ slashes, as shown below.

In [26]:
print(re.search(r"\section", input_with_slash_section))
print(re.search(r"\\section", input_with_slash_section))
print(re.search(r"\\\\section", input_with_slash_section))

None
<re.Match object; span=(22, 30), match='\\section'>
None


Indeed, it is common style to always use raw strings for regular expression patterns, as we'll do in the examples that follow.

---

**Creating pattern groups.** Another handy construct are [_pattern groups_](https://docs.python.org/3/howto/regex.html#grouping), as we show in the next code cell.

Suppose we have a string that we know contains a name of the form, "(first) (middle) (last)", where the middle name is _optional_. We can use pattern groups to isolate each component of the name and tag the middle name as optional using the "zero-or-one" metacharacter, `'?'`.

The group itself is a subpattern enclosed within parentheses. When a match is found, we can extract the groups by calling `.groups()` on the match object, which returns a tuple of all matched groups.

> To make this pattern more readable, we have also used Python's multiline string literal combined with the [`re.VERBOSE` option](https://docs.python.org/2/library/re.html#re.VERBOSE), which then allows us to include whitespace and comments as part of the pattern string.

In [27]:
# Make the expression more readable with a re.VERBOSE pattern
re_names2 = re.compile(r'''^              # Beginning of string
                           ([a-zA-Z]+)    # First name
                           \s+            # At least one space
                           ([a-zA-Z]+\s)? # Optional middle name
                           ([a-zA-Z]+)    # Last name
                           $              # End of string
                        ''',
                        re.VERBOSE)
print(re_names2.match('Rich Vuduc').groups())
print(re_names2.match('Rich S Vuduc').groups())
print(re_names2.match('Rich Salamander Vuduc').groups())

('Rich', None, 'Vuduc')
('Rich', 'S ', 'Vuduc')
('Rich', 'Salamander ', 'Vuduc')


The regular expression `[a-zA-Z]+` is a pattern that matches one or more consecutive alphabetical characters (letters) in a string, regardless of whether they are uppercase or lowercase. Here's how it works:

- `[a-zA-Z]`: This is a character class that matches any single alphabetical character. The range `a-z` represents lowercase letters, and the range `A-Z` represents uppercase letters.

- `+`: This is a quantifier that specifies "one or more" of the preceding pattern. In this case, it means one or more consecutive alphabetical characters.

So, `[a-zA-Z]+` will match any substring in a text that consists of one or more consecutive alphabetical characters (letters), and it's case-insensitive, meaning it will match both lowercase and uppercase letters.

Examples of matches:
- "Hello"
- "World"
- "abc"
- "XyZ"
- "aBcD"

Examples of non-matches:
- "12345" (no letters)
- "word123" (letters and numbers combined)
- "!" (no letters)
- "A B" (spaces between letters)

The regular expression `([a-zA-Z]+\s)?` is a pattern that matches a sequence of one or more consecutive alphabetical characters (letters) followed by an optional whitespace character. Here's how it works:

- `([a-zA-Z]+\s)?`: This is an expression enclosed in parentheses and followed by a `?`. 

  - `[a-zA-Z]+`: This part matches one or more consecutive alphabetical characters (letters), as explained earlier.

  - `\s`: This matches a single whitespace character (e.g., space, tab, newline).

  - `?`: The `?` is a quantifier that makes the preceding expression `([a-zA-Z]+\s)` optional, meaning it can occur zero or one time.

So, the entire pattern `([a-zA-Z]+\s)?` will match:

- A sequence of one or more letters followed by a whitespace character (e.g., "Hello ").
- Just a sequence of one or more letters without a space (e.g., "World").
- Nothing at all (an empty string or a string without any matching sequence).

This pattern is useful for capturing words with an optional space after them or for matching patterns with or without trailing spaces.

---

**Tagging pattern groups.** You can also name pattern groups, which helps make your extraction code a bit more readable.

In [28]:
# Named groups
re_names3 = re.compile(r'''^
                           (?P<first>[a-zA-Z]+)
                           \s
                           (?P<middle>[a-zA-Z]+\s)?
                           \s*
                           (?P<last>[a-zA-Z]+)
                           $
                        ''',
                        re.VERBOSE)
print(re_names3.match('Rich Vuduc').group('first'))
print(re_names3.match('Rich S Vuduc').group('middle'))
print(re_names3.match('Rich Salamander Vuduc').group('last'))

Rich
S 
Vuduc


**A regular expression debugger.** Regular expressions can be tough to write and debug, but thankfully, there are several online tools to help! See, for instance, [regex101](https://regex101.com/), [pythex](https://pythex.org/), [regexr](https://regexr.com/), or [debuggex](https://www.debuggex.com/). These all allow you to supply some sample input text and test what your pattern does in real time.

---

## Email addresses

In the next exercise, you'll apply what you've read and learned about regular expressions to build a pattern matcher for email addresses. Again, if you haven't looked through the HOWTO yet, take a moment to do that!

Although there is a [formal specification of what constitutes a valid email address](https://tools.ietf.org/html/rfc5322#section-3.4.1), for this exercise, let's use the following simplified rules.

* We will restrict our attention to ASCII addresses and ignore Unicode. If you don't know what that means, don't worry about it---you shouldn't need to do anything special given our code templates, below.
* An email address has two parts, the username and the domain name. These are separated by an `@` character.
* A username **must begin with an alphabetic** character. It may be followed by any number of additional _alphanumeric_ characters or any of the following special characters: `.` (period), `-` (hyphen), `_` (underscore), or `+` (plus).
* A domain name **must end with an alphabetic** character. It may consist of any of the following characters: alphanumeric characters, `.` (period), `-` (hyphen), or `_` (underscore).
* Alphabetic characters may be uppercase or lowercase.
* No whitespace characters are allowed.

Valid domain names usually have additional restrictions, e.g., there are a limited number of endings, such as `.com`, `.edu`, and so on. However, for this exercise you may ignore this fact.

---

**Exercise 1** (2 points). Write a function `parse_email` that, given an email address `s`, returns a tuple, `(user-id, domain)` corresponding to the user name and domain name.

For instance, given `richie@cc.gatech.edu` it should return `('richie', 'cc.gatech.edu')`.

Your function should parse the email only if it exactly matches the email specification. For example, if there are leading or trailing spaces, the function should *not* match those. See the test cases for examples.

If the input is not a valid email address, the function should raise a `ValueError`.

> The requirement, "raise a `ValueError`" refers to a technique for handling errors in a program known as _exception handling_. The Python documentation covers [exceptions](https://docs.python.org/3/tutorial/errors.html) in more detail, including [raising `ValueError` objects](https://docs.python.org/3/tutorial/errors.html#raising-exceptions).

We have provided wrapper function `eif_wrapper`. This wrapper will capture whether or not your function raised a `ValueError` and the returned value from the function. Additionally, any `ValueErrors` raised by your function will not halt execution of the notebook.

`eif_wrapper` is provided for you in the cell below. The function inputs are `s`, the input to be evaluated, and `func` the function that can raise a `ValueError`.

In [29]:
# ValueError wrapper
def eif_wrapper(s, func):
    """
    Returns a (bool, function return) pair where the first element is True when a ValueError is raised 
    and False if a Value Error is not raised. The second output is the return value from calling `func(s)`.
    """
    raised_value_error = False
    result = None
    try:
        result = func(s)
    except ValueError:
        raised_value_error = True
    finally:
        return (raised_value_error, result)

In [30]:
### Define demo inputs
demo_str_ex1_list = ['richie@cc.gatech.edu',
                     'what-do-you-know+not-much@gmail.com',
                     'x @hpcgarage.org',
                     'richie@cc.gatech.edu7']

<!-- Expected demo output text block -->
The demo included in the solution cell below should display the following output:
```
eif_wrapper('richie@cc.gatech.edu', parse_email) -> (False, ('richie', 'cc.gatech.edu'))
eif_wrapper('what-do-you-know+not-much@gmail.com', parse_email) -> (False, ('what-do-you-know+not-much', 'gmail.com'))
eif_wrapper('x @hpcgarage.org', parse_email) -> (True, None)
eif_wrapper('richie@cc.gatech.edu7', parse_email) -> (True, None)
```
<!-- Include any shout outs here -->

In [31]:
### Exercise 1 solution
def parse_email(s):

    valid_email = re.compile(r'''^                  # Beginning of string
                                [a-zA-Z]            # the first letter    
                                [a-zA-Z0-9.\-_+]*    # followed by any number of alpha numeric and . - _ +
                                @                 # the @
                                [a-zA-Z0-9.\-_]*     # alphanumeric with . _ -
                                [a-zA-Z]            # the last letter
                                $                   # end of string
                             ''',
                             re.VERBOSE)
    
    match = valid_email.match(s)
    
    if match:
        return (match.group(0).split('@')[0], match.group(0).split('@')[1])
    else:
        raise ValueError("Invalid email address")

In [32]:
### demo function call
for demo_str_ex1 in demo_str_ex1_list:
    print(f"eif_wrapper({demo_str_ex1}, parse_email) -> {eif_wrapper(demo_str_ex1, parse_email)}")

eif_wrapper(richie@cc.gatech.edu, parse_email) -> (False, ('richie', 'cc.gatech.edu'))
eif_wrapper(what-do-you-know+not-much@gmail.com, parse_email) -> (False, ('what-do-you-know+not-much', 'gmail.com'))
eif_wrapper(x @hpcgarage.org, parse_email) -> (True, None)
eif_wrapper(richie@cc.gatech.edu7, parse_email) -> (True, None)


<!-- Test Cell Boilerplate -->
The cell below will test your solution for Exercise 1. The testing variables will be available for debugging under the following names in a dictionary format.
- `input_vars` - Input variables for your solution. 
- `original_input_vars` - Copy of input variables from prior to running your solution. These _should_ be the same as `input_vars` - otherwise the inputs were modified by your solution.
- `returned_output_vars` - Outputs returned by your solution.
- `true_output_vars` - The expected output. This _should_ "match" `returned_output_vars` based on the question requirements - otherwise, your solution is not returning the correct output. 

In [33]:
### test_cell_ex1
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_1', 
    'func': lambda s: eif_wrapper(s, parse_email), # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        's':{
            'dtype':'str', # data type of param.
            'check_modified':False,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'bool',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': True, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        },
        'output_1':{
            'index':1,
            'dtype':'NoneType',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': True, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'F24oYNyh8kq_wkZD_Oo0ZCPHLcoO-xNXHOYNiPnQmmY=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

print('Passed! Please submit.')

Passed! Please submit.


---

## Phone numbers

**Exercise 2** (2 points). Write a function to parse US phone numbers written in the canonical "(404) 555-1212" format, i.e., a three-digit area code enclosed in parentheses followed by a seven-digit local number in three-hyphen-four digit format. It should also **ignore** all leading and trailing spaces, as well as any spaces that appear between the area code and local numbers. However, it should **not** accept any spaces in the area code (e.g., in '(404)') nor should it in the seven-digit local number.

For example, these would be considered valid phone number strings:
```python
    '(404) 121-2121'
    '(404)121-2121     '
    '   (404)      121-2121'
```

By contrast, these should be rejected:
```python
    '404-121-2121'
    '(404)555 -1212'
    ' ( 404)121-2121'
    '(abc) 555-12i2'
```

It should return a triple of strings, `(area_code, first_three, last_four)`. 

If the input is not a valid phone number, it should raise a `ValueError`.

The same wrapper function `eif_wrapper` from Exercise 1 will be used for evaluation.

In [34]:
### Define demo inputs
demo_str_ex2_list = ['(404) 121-2121',
                     '404-121-2121']

<!-- Expected demo output text block -->
The demo included in the solution cell below should display the following output:
```
eif_wrapper('(404) 121-2121', parse_phone1) -> (False, ('404', '121', '2121'))
eif_wrapper('404-121-2121', parse_phone1) -> (True, None)
```
<!-- Include any shout outs here -->
**Recall that `eif_wrapper` returns two outputs**:
- `bool` indicating whether or not your function raised a `ValueError`
- `tuple` (the result of your parsing function) or `None` (when the parsing function raises an error)

In [35]:
valid_number = re.compile(r'^\(\d{3}\)\s*\d{3}-\d{4}$')

In [36]:
s= '(404) 121-2121'

In [37]:
match = valid_number.match(s)

In [38]:
match.group(0)

'(404) 121-2121'

In [39]:
match.group(0).split('-')

['(404) 121', '2121']

In [40]:
pattern = re.compile(r'\((\d{3})\)\s*(\d{3})-(\d{4})')

In [41]:
match = re.search(pattern, s)

In [42]:
match.group(0)

'(404) 121-2121'

In [43]:
match.group(1)

'404'

In [44]:
def parse_phone1(s):
    
    valid_number = re.compile('\((\d{3})\)\s*(\d{3})-(\d{4})')
    
    match = re.search(valid_number, s)
    
    if match:
        return (match.group(1),match.group(2),match.group(3))
    
    else:
        raise ValueError("Invalid number")

In [45]:
### demo function call
for demo_str_ex2 in demo_str_ex2_list:
    print(f"eif_wrapper('{demo_str_ex2}', parse_phone1) -> {eif_wrapper(demo_str_ex2, parse_phone1)}")

eif_wrapper('(404) 121-2121', parse_phone1) -> (False, ('404', '121', '2121'))
eif_wrapper('404-121-2121', parse_phone1) -> (True, None)


<!-- Test Cell Boilerplate -->
The cell below will test your solution for Exercise 2. The testing variables will be available for debugging under the following names in a dictionary format.
- `input_vars` - Input variables for your solution. 
- `original_input_vars` - Copy of input variables from prior to running your solution. These _should_ be the same as `input_vars` - otherwise the inputs were modified by your solution.
- `returned_output_vars` - Outputs returned by your solution.
- `true_output_vars` - The expected output. This _should_ "match" `returned_output_vars` based on the question requirements - otherwise, your solution is not returning the correct output. 

In [46]:
### test_cell_ex2
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_2', 
    'func': lambda s: eif_wrapper(s, parse_phone1), # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        's':{
            'dtype':'str', # data type of param.
            'check_modified':False,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'bool',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': True, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        },
        'output_1':{
            'index':0,
            'dtype':'NoneType',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': True, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        }
    }
}
tester = Tester(conf, key=b'F24oYNyh8kq_wkZD_Oo0ZCPHLcoO-xNXHOYNiPnQmmY=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

print('Passed! Please submit.')

Passed! Please submit.


---

**Exercise 3** (3 points). Implement an enhanced phone number parser that can handle any of these patterns.

* (404) 555-1212
* (404) 5551212
* 404-555-1212
* 404-5551212
* 404555-1212
* 4045551212

As seen in the examples above, this parser should also handle the cases in which area code is not enclosed in parentheses. As before, it should not be sensitive to leading or trailing spaces. Also, for the patterns in which the area code is enclosed in parentheses, it should not be sensitive to the number of spaces separating the area code from the remainder of the number.

The same wrapper function `eif_wrapper` from Exercise 1 will be used for evaluation.

In [47]:
### Define demo inputs
demo_str_ex3_list = ['404-5551212',
                     '(404-555-1212']

<!-- Expected demo output text block -->
The demo included in the solution cell below should display the following output:
```
eif_wrapper('404-5551212', parse_phone2) -> (False, ('404', '555', '1212'))
eif_wrapper('(404-555-1212', parse_phone2) -> (True, None)
```
<!-- Include any shout outs here -->
**Recall that `eif_wrapper` returns two outputs**:
- `bool` indicating whether or not your function raised a `ValueError`
- `tuple` (the result of your parsing function) or `None` (when the parsing function raises an error)

In [48]:
pattern = r'''
            (\(\d{3}\)|\d{3})  # Match either (XXX) or XXX
            
            [\s-]?                  # optional whitespace or hyphen
            
            (\d{3})
            
            [\s-]?                   # optional whitespace or hyphen
            
            (\d{4})
            
           
            '''

valid_number = re.compile(pattern, re.VERBOSE)

# Test the pattern
phone1 = '(404) 555-1212'
phone2 = '404 555-1212'
phone3 = '(404-555-1212'
phone4 = '404) 555-1212'

print(valid_number.match(phone1))  # Should match
print(valid_number.match(phone2))  # Should match
print(valid_number.match(phone3))  # Should not match
print(valid_number.match(phone4))  # Should not match


<re.Match object; span=(0, 14), match='(404) 555-1212'>
<re.Match object; span=(0, 12), match='404 555-1212'>
None
None


---

In [156]:
pattern = r'''
            ^\s*               # Leading spaces

            (    
            \(\d{3}\)\s*       # "(XXX) "  
            |                  # OR
            \d{3}-?            # "xxx" or "xxx-"
            )  
            


            
            (\d{3})             # Match XXX
            
            [\s-]?               # optional whitespace or hyphen
            
            (\d{4})            # Match XXXX
            
            \s*$       # Trailing spaces

            
            '''

valid_number = re.compile(pattern, re.VERBOSE)

---

In [142]:
phone1 = '4045551212'

In [143]:
match = re.search(valid_number, phone1)

In [144]:
match.group(0)

'4045551212'

In [145]:
match.group(1)

'404'

In [146]:
match.group(2)

'555'

In [147]:
match.group(3)

'1212'

---

In [148]:
phone2 = '404-5551212'

In [149]:
match = re.search(valid_number, phone2)

In [150]:
match.group(0)

'404-5551212'

In [151]:
match.group(1)

'404-'

In [152]:
match.group(2)

'555'

---

In [153]:
phone3 = '(404-555-1212'

In [154]:
match = re.search(valid_number, phone3)

In [155]:
match.group(0)

AttributeError: 'NoneType' object has no attribute 'group'

In [64]:
match.group(1)

'404'

In [65]:
match.group(2)

'555'

---

In [157]:
phone4 = '(404)555-1212'

In [158]:
match = re.search(valid_number, phone4)

In [159]:
match.group(0)

'(404)555-1212'

In [160]:
match.group(1)

'(404)'

In [161]:
match.group(2)

'555'

In [162]:
match.group(3)

'1212'

---

**Exercise 3** (3 points). Implement an enhanced phone number parser that can handle any of these patterns.

* (404) 555-1212
* (404) 5551212
* 404-555-1212
* 404-5551212
* 404555-1212
* 4045551212

As seen in the examples above, this parser should also handle the cases in which area code is not enclosed in parentheses. As before, it should not be sensitive to leading or trailing spaces. Also, for the patterns in which the area code is enclosed in parentheses, it should not be sensitive to the number of spaces separating the area code from the remainder of the number.

The same wrapper function `eif_wrapper` from Exercise 1 will be used for evaluation.

In [170]:
def parse_phone2(s):
    
    
    pattern = r'''
            ^\s*               # Leading spaces

            (    
            \(\d{3}\)\s*       # "(XXX) "  
            |                  # OR
            \d{3}-?            # "xxx" or "xxx-"
            )  
            


            
            (\d{3})             # Match XXX
            
            -?               # optional  hyphen
            
            (\d{4})            # Match XXXX
            
            \s*$       # Trailing spaces

            
            '''

    valid_number = re.compile(pattern, re.VERBOSE)
    
    match = valid_number.match(s)
    
    if AttributeError:
        raise ValueError
    
    if match:
        phone_number = re.search(valid_number, s)
        
        area_code = phone_number.group(1)

        if "(" in area_code:
            area_code=area_code.replace("(","")
        if ")" in area_code:
            area_code=area_code.replace(")","")
        if "-" in area_code:
            area_code=area_code.replace("-","")
        
        return(area_code, phone_number.group(2), phone_number.group(3))
        
    
    

In [172]:
def parse_phone2(s):
        pattern = '''
                ^\s*               # Leading spaces
                (?P<areacode>
                \d{3}-?         # "xxx" or "xxx-"
                | \(\d{3}\)\s*  # OR "(xxx) "
                )        
                (?P<prefix>\d{3})  # xxx        
                -?                 # Dash (optional)        
                (?P<suffix>\d{4})  # xxxx        
                \s*$       # Trailing spaces
            '''
        matcher = re.compile(pattern, re.VERBOSE)
        matches = matcher.match(s)
        if matches is None:
            raise ValueError("'{}' is not in the right format.".format (s))
        areacode = re.search('\d{3}', matches.group ('areacode')).group()
        prefix = matches.group ('prefix')
        suffix = matches.group ('suffix')
        return (areacode, prefix, suffix)

In [110]:
### demo function call
for demo_str_ex3 in demo_str_ex3_list:
    print(f"eif_wrapper('{demo_str_ex3}', parse_phone2) -> {eif_wrapper(demo_str_ex3, parse_phone2)}")

eif_wrapper('404-5551212', parse_phone2) -> (False, ('404', '555', '1212'))
eif_wrapper('(404-555-1212', parse_phone2) -> (True, None)


<!-- Test Cell Boilerplate -->
The cell below will test your solution for Exercise 3. The testing variables will be available for debugging under the following names in a dictionary format.
- `input_vars` - Input variables for your solution. 
- `original_input_vars` - Copy of input variables from prior to running your solution. These _should_ be the same as `input_vars` - otherwise the inputs were modified by your solution.
- `returned_output_vars` - Outputs returned by your solution.
- `true_output_vars` - The expected output. This _should_ "match" `returned_output_vars` based on the question requirements - otherwise, your solution is not returning the correct output. 

In [173]:
### test_cell_ex3
from tester_fw.testers import Tester

conf = {
    'case_file':'tc_3', 
    'func': lambda s: eif_wrapper(s, parse_phone2), # replace this with the function defined above
    'inputs':{ # input config dict. keys are parameter names
        's':{
            'dtype':'str', # data type of param.
            'check_modified':False,
        }
    },
    'outputs':{
        'output_0':{
            'index':0,
            'dtype':'bool',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': True, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        },
        'output_1':{
            'index':1,
            'dtype':'NoneType',
            'check_dtype': True,
            'check_col_dtypes': True, # Ignored if dtype is not df
            'check_col_order': True, # Ignored if dtype is not df
            'check_row_order': True, # Ignored if dtype is not df
            'check_column_type': True, # Ignored if dtype is not df
            'float_tolerance': 10 ** (-6)
        },
        
    }
}
tester = Tester(conf, key=b'F24oYNyh8kq_wkZD_Oo0ZCPHLcoO-xNXHOYNiPnQmmY=', path='resource/asnlib/publicdata/')
for _ in range(70):
    try:
        tester.run_test()
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise

print('Passed! Please submit.')

Passed! Please submit.


**Fin!** This cell marks the end of Part 0. Don't forget to save, restart and rerun all cells, and submit it. When you are done, proceed to Parts 1 and 2.