## Regex Lab

If you use notebooks you can open this file and work directly in it.  If you prefer to work in a REPL you can copy & paste these short snippets into ipython as you go.

In [1]:

# To make it easy to test your regexes, I've provided a helper class
# that you can use to test your regexes against a set of strings that
# should match and a set of strings that should not match.

# Note: the test cases are not exhaustive, but meant to provide some useful sample
# cases.
import re

class RegexProblem:
    def __init__(self, matches, non_matches):
        self.matches = matches
        self.non_matches = non_matches
    
    def try_regex(self, regex_str):
        regex = re.compile(regex_str, re.VERBOSE)
        wrong = False
        for match in self.matches:
            if not regex.fullmatch(match):
                print(f"{match} should match but doesn't")
                wrong = True
        for non_match in self.non_matches:
            if regex.fullmatch(non_match):
                print(f"{non_match} should not match but does")
                wrong = True
        if not wrong:
            print("All tests passed!")


# example: match any string that only contains a's b's and/or c's
example_problem = RegexProblem(["a", "bb", "ccc", "abc", "aaaaaaaa"], ["d", "ace", "xyz"])


In [2]:
# you can experiment with it (in a REPL or in a notebook) like this:
example_problem.try_regex("[abc]")

bb should match but doesn't
ccc should match but doesn't
abc should match but doesn't
aaaaaaaa should match but doesn't


In [3]:
# oh right! [abc] only matches a single character, so it doesn't match "bb"
# let's try again:
example_problem.try_regex("[abc]+")

All tests passed!


In [4]:
# For this lab, with each problem below, try finding a regex that satisfies try_regex!

## Problem 1

Let's build a regular expression that matches floating point numbers.  For this problem we'll go one piece at a time:

In [5]:
# first start with a regex that only matches integers
digits_only = RegexProblem(["1", "123", "456789", "094"], ["a", "4.5", "-75.412", "43a4"])

In [6]:
digits_only.try_regex("\d+")

All tests passed!


In [7]:
# notice your first regex allowed numbers like 094, modify the regex to not allow zero as a leading digit
integers_only = RegexProblem(["1", "123", "456789"], ["a", "4.5", "-75.412", "43a4", "094"])

In [8]:
integers_only.try_regex("[1-9]\d*")

All tests passed!


In [12]:
# now lets allow negative numbers too, a single (optional) leading dash
allow_negatives = RegexProblem(["1", "123", "456789", "-7", "-90"], ["a", "4.5", "-75.412", "43a4", "094", "-04"])

In [9]:
allow_negatives.try_regex("-?[1-9]\d*")

NameError: name 'allow_negatives' is not defined

In [10]:
# alright, now allow the decimal part to be added, this entire part is optional
floats = RegexProblem(["1", "123", "456789", "-7", "-90", "4.5", "-75.412"], ["a", "43a4", "094", "-04"])
floats.try_regex("-?[1-9]\d*(\.\d+)?")

All tests passed!


In [20]:
# does your regex allow anything strange? 
# try modifying the test cases to see if you can find any bugs in your regex

## Problem 2

Let's build something that matches URLs.

As we saw before, URLs take the form:

`protocol :// domain [:port] / [path]`

This time, you can test your own regexes as you go to build a regex that matches this entire pattern.

For our purposes we'll define each part as follows:

* protocol - any string from 2-8 characters long that ends in `://`
* domain - a mixture of letters and numbers optionally separated by dots '.'  (e.g. "example.com", "101domain.com", "localhost", "cs.uchicago.edu") Don't worry about invalid endings, etc. 
* port - a colon followed by a number (e.g. :8000, :443)
* path - must begin with slash, then same rules as domain but also allow forward slashes  (e.g. "/", "/index", "/index.html", "/a/123")

In [15]:
# feel free to practice with these, or go straight to the URL test cases
protocol = RegexProblem(["ws://", "http://", "https://"], ["ws:", "ws//", "x://", "http:/", "99://"])
domain = RegexProblem(["example.com", "101domain.com", "localhost", "cs.uchicago.edu", "a.b.c.biz", "127.0.0.1"],
                      ["no spaces.com", "bad?com", ""])
# if you want to write your own test cases for port and path you can do so
protocol.try_regex("[a-z]{2,8}://")
domain.try_regex("[\w\.]+")

All tests passed!
All tests passed!


In [16]:
url = RegexProblem(
    ["http://example.com", "http://localhost:8000", "ftp://127.0.0.1", 
     "http://example.com:80/index.html", "ws://a.b.c.edu/fruit/apple/123"],
    ["http://", "example.com", "http://:0", "http://localhost:8000:8000", "http://test?biz",
     "http://example.com/art/238$!!@", "https://e$.com"]
)
# this list is far from exhaustive, but should provide a decent starting point
url.try_regex("[a-z]{2,8}://[\w\.]+(:\d+)?(/[\w\.\-]+)*")

All tests passed!



### Note
Please note that in practice you should always opt to use urlparse to parse URLs instead of trying to 
build your own regex.  The rules given here are simplified from the actual URL syntax.

https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlparse

## Problem 3

Remember that a URL can end with a query string.

A query string looks like `?var=val&var2=other_val&var3=123` and comes at the end of the URL.

This is a good fit for `findall`.  We won't use `RegexProblem` anymore, but instead use `re.findall` to break up these query strings into their components. 

To keep things simple, let's use these rules:

Each segment of a query string is separated from the others by an `&` and consists of a `name` and `value`.

`names` must be comprised of letters, numbers, and underscores.

`values` must be comprised of letters, numbers, and any punctuation that isn't a `&`.

Use `re.findall` to extract these patterns.  (Hint: write your pattern to match *one* name=value pair)


In [20]:
query_strings = [
    "?q=hello",
    "?q=hello&lang=en",
    "?q=plus+is+used+for+space+in+queries",
    "?a=1&b=2&c=3&d=4",
    "?first_name=Paul&last_name=Rust",
]

In [21]:
# write your regex here
qs_regex = re.compile("(\w+=[\w\+]+)")

In [22]:
for query_string in query_strings:
    print(qs_regex.findall(query_string))

['q=hello']
['q=hello', 'lang=en']
['q=plus+is+used+for+space+in+queries']
['a=1', 'b=2', 'c=3', 'd=4']
['first_name=Paul', 'last_name=Rust']


Your output should look like:
```
['q=hello']
['q=hello', 'lang=en']
['q=plus+is+used+for+space+in+queries']
['a=1', 'b=2', 'c=3', 'd=4']
['first_name=Paul', 'last_name=Rust']
```

**Additional challenge:** Ty using groups to capture the key and value separately so you instead get 

```
[('q', 'hello')]
[('q', 'hello'), ('lang', 'en')]
[('q', 'plus+is+used+for+space+in+queries')]
[('a', '1'), ('b', '2'), ('c', '3'), ('d', '4')]
```

In [23]:
qs_regex_improved = re.compile("(\w+)=([\w\+]+)")
for query_string in query_strings:
    print(qs_regex_improved.findall(query_string))

[('q', 'hello')]
[('q', 'hello'), ('lang', 'en')]
[('q', 'plus+is+used+for+space+in+queries')]
[('a', '1'), ('b', '2'), ('c', '3'), ('d', '4')]
[('first_name', 'Paul'), ('last_name', 'Rust')]


### Problem 4

We want to count how many times the word "mother" appears in shakespeare.txt.

In [25]:
with open("shakespeare.txt") as f:
    corpus = f.read()

corpus.count("mother")

In [41]:
corpus.count("mother")   # it is tempting to use str.count

441

In [42]:
corpus.count("Mother")   # we want to ignore case too (which we could do with lower() in theory)

35

In [47]:
corpus.count("smothered")  # but there are other words that contain "mother"

2

In [44]:
corpus.count(" mother ") # it is tempting to try something like this

143

In [27]:
corpus.count(" mother,") # but there are a lot of variations to account for

80

In [28]:
# try to find a regex to find all instances of the word "mother" regardless of case
# but excluding other words
# Hint: Take a look at flags and anchors in the notes.

In [31]:
len(re.findall(r"\bmother\b", corpus, re.IGNORECASE))

437

In [None]:
# you can call `len` on this to get the total count -- my count was 437