In [None]:
from lec_utils import *

<div class="alert alert-info" markdown="1">

#### Lecture 11

# Regular Expressions

### EECS 398-003: Practical Data Science, Fall 2024

<small><a style="text-decoration: none" href="https://practicaldsc.org">practicaldsc.org</a> • <a style="text-decoration: none" href="https://github.com/practicaldsc/fa24">github.com/practicaldsc/fa24</a></small>
    
</div>

### Announcements 📣

- Homework 5 is due on **Thursday**. It includes a required [**Pre-Midterm Survey**](https://docs.google.com/forms/d/e/1FAIpQLSfCT2TfFUWF0gbnfuV_at0bG3w0Za9-KuLIA7cpZm0NL5jbKQ/viewform).<br><small>Homework 6 will be released later this week, but won't be due until after Fall Break.</small>
- The Midterm Exam is on **Wednesday, October 9th from 7-9PM**.
    - Lectures 1-12 and Homeworks 1-6 are in scope.
    - The lecture before the exam will be review, and the TAs will run a review session on **Monday from 6-8PM in FXB 1109** too.
    - You can bring **one double-sided 8.5"x11" notes sheet that you handwrite yourself (no printing, no using an iPad, etc.)**.
    - Work through old exam problems [**here**](https://study.practicaldsc.org/).
- Looking for sources of data, or other supplemental resources? Look at our updated [**Resources**](https://practicaldsc.org/resources) page!

### Aside: Spreadsheets

- We recorded a walkthrough video, [**linked here**](https://www.loom.com/share/eb06b185428542c391f21e55480a0d2d?sid=21ac597b-b8e5-4b3b-8c09-a3a3d7d5a219), on the spreadsheets part of Lecture 10 that we didn't get to cover last Thursday.<br><small>We won't have time to cover it now, either.</small>

In [None]:
from IPython.display import IFrame
IFrame(src='https://www.loom.com/embed/eb06b185428542c391f21e55480a0d2d?sid=3891cb7f-a4c9-4a34-8211-0347a283d413',
       width=400, height=300)

- The spreadsheet I created in the video can be found [**here**](https://docs.google.com/spreadsheets/d/15RspFWbO_x7PHOJHc-DuyhTrQOlJ66nIKRBIkHBfcyM/edit?usp=sharing).

- Spreadsheets won't be in an assignment or on an exam, but they will be useful at some point in your life, so now is as good a time as any to follow along and learn!<br><small>Let me know if you have any feedback or comments on it.</small>

### Agenda

Today's lecture will mostly be about **regular expressions**. Good resources:
- [regex101.com](https://regex101.com), a helpful site to have open while writing regular expressions.
- Python [`re` library documentation](https://docs.python.org/3/library/re.html) and [how-to](https://docs.python.org/3/howto/regex.html).<br><small>The "how-to" is great, read it!</small>
- [regex "cheat sheet"](https://practicaldsc.org/resources/other/berkeley-regex-reference.pdf).
- These are all on the [**resources tab of the course website**](https://practicaldsc.org/resources) as well.

## Motivation

---

In [None]:
email = '''
Thank you for buying our expensive product!
If you have a complaint, please send it to complaints@compuserve.com or call (800) 867-5309.
If you are happy with your purchase, please call us at (800) 123-4567; we'd love to hear from you!
Due to high demand, please allow one-hundred (100) business days for a response.
'''

### Who called? 📞

- **Goal**: Extract all phone numbers from a piece of text, **assuming** they are of the form `'(###) ###-####'`.

In [None]:
print(email)

- We can do this using the same string methods we've come to know and love.

- Strategy:
    - Split by spaces.
    - Check if there are any consecutive "words" where:
        - the first "word" looks like an area code, like `'(678)'`.
        - the second "word" looks like the last 7 digits of a phone number, like `'999-8212'`. 

### Checking formatting

- Let's first implement a function that takes in a string and returns whether it looks like an area code.

In [None]:
def is_possibly_area_code(s):
    '''Does `s` look like (678)?'''
    return (len(s) == 5 and
            s.startswith('(') and
            s.endswith(')') and
            s[1:4].isnumeric())

In [None]:
is_possibly_area_code('(123)')

In [None]:
is_possibly_area_code('(99)')

- Let's also implement a function that takes in a string and returns whether it looks like the last 7 digits of a phone number.

In [None]:
def is_last_7_phone_number(s):
    '''Does `s` look like 999-8212?'''
    return len(s) == 8 and s[0:3].isnumeric() and s[3] == '-' and s[4:].isnumeric()

In [None]:
is_last_7_phone_number('999-8212')

In [None]:
is_last_7_phone_number('534 1100')

- Finally, let's split the entire text by spaces, and check whether there are any instances where `pieces[i]` looks like an area code and `pieces[i+1]` looks like the last 7 digits of a phone number.

In [None]:
print(email)

In [None]:
# Removes punctuation from the end of each string.
pieces = [s.rstrip('.,?;"\'') for s in email.split()]
for i in range(len(pieces) - 1):
    if is_possibly_area_code(pieces[i]):
        if is_last_7_phone_number(pieces[i+1]):
            print(pieces[i], pieces[i+1])

### Is there a better way?

- This was an example of **pattern matching**.

- Pattern matching can be done with string methods, but there is often a better approach: **regular expressions**.

In [None]:
print(email)

In [None]:
import re
re.findall(r'\(\d{3}\) \d{3}-\d{4}', email)

<center><h3>🤯</h3></center>

## Basic regular expressions

---

### Regular expressions

- A regular expression, or **regex** for short, is a sequence of characters used to **match patterns in strings**.

- For example, `\(\d{3}\) \d{3}-\d{4}` describes a **pattern** that matches US phone numbers of the form `'(XXX) XXX-XXXX'`.

- Think of regex as a "mini-language".<br><small>Formally, they are a **grammar** for describing a language.</small>

- **Pros ✅**: They are very powerful and are widely used – virtually every programming language has a module for working with them.

- **Cons ❌**: They can be hard to read and have many different "dialects."

### Writing regular expressions

- You will ultimately write most of your regular expressions in Python, using the `re` module. We will see how to do so shortly.

- However, a useful tool for designing regular expressions is [**regex101.com**](https://regex101.com).

- We will use it heavily during lecture; you should have it open as we work through examples. **If you're trying to revisit this lecture in the future, you'll likely want to watch the recording; just looking at the notebook won't give you enough context.**

### Literals

- A literal is a character that has no special meaning.

- Letters, numbers, and some symbols are all literals.

- Some symbols, like `.`, `*`, `(`, and `)`, are special characters.

- ***Example***: The regex `hey` matches the string `'hey'`. The regex `he.` also matches the string `'hey'`.

### Regex building blocks 🧱

The four main building blocks for all regexes are shown below.<br><small><a href="https://www.cs.princeton.edu/courses/archive/spring17/cos226/lectures/54RegularExpressions.pdf)">table source</a>, <a href="https://docs.google.com/presentation/d/1xQsqa7e3xDZ9nBiekbSBOecwvQm8pSVGa-FBoV6aJ7E/edit#slide=id.g11197671c7e_0_919">inspiration</a>.</small>

| operation | order of op. | example | matches ✅ | does not match ❌ |
|:--- |:---|:---|:---|:---|
| <span style='color:purple'><b>concatenation</b></span> | 3 | `AABAAB` | `'AABAAB'` | every other string |
| <span style='color:purple'><b>or</b></span> | 4 | `AA\|BAAB` | `'AA'`, `'BAAB'` | every other string |
| <span style='color:purple'><b>closure</b><br>(zero or more)</span> | 2 | `AB*A` | `'AA'`, `'ABBBBBBA'` | `'AB'`, `'ABABA'` |
| <span style='color:purple'><b>parentheses</b></span> | 1 | `A(A\|B)AAB` <hr style="height:1px"> `(AB)*A` | `'AAAAB'`, `'ABAAB'`<hr style="height:1px">`'A'`, `'ABABABABA'` | every other string<hr style="height:1px">`'AA'`, `'ABBA'` |

Note that `|`, `(`, `)`, and `*` are **special characters**, not literals. They manipulate the characters around them.

***Example (or, parentheses)***:
- What does `EECS 280|398` match?
- What does `EECS (280|398)` match?

***Example (closure, parentheses)***:
- What does `eecs*` match?
- What does `(eecs)*` match?

<div class="alert alert-success" markdown="1">
    <h3>Activity</h3>

Write a regular expression that matches `'billy'`, `'billlly'`, `'billlllly'`, etc.
- First, think about how to match strings with any even number of `'l'`s, including zero `'l'`s (i.e. `'biy'`).
- Then, think about how to match only strings with a **positive even** number of `'l'`s.

</div>

<br>

<details>
<summary>
    ✅ Click here to see the answer <b>after</b> you've tried it yourself at <a href='https://regex101.com'>regex101.com</a>.
</summary>
<code>bi(ll)*y</code> will match any even number of <code>'l'</code>s, including 0.
    
To match only a positive even number of <code>'l'</code>s, we'd need to first "fix into place" two <code>'l'</code>s, and then follow that up with zero or more pairs of <code>'l'</code>s. This specifies the regular expression <code>bill(ll)*y</code>.
    </details>

<div class="alert alert-success" markdown="1">
    <h3>Activity</h3>

Write a regular expression that matches `'billy'`, `'billlly'`, `'biggy'`, `'biggggy'`, etc.

<br>

Specifically, it should match any string with a **positive even** number of `'l'`s in the middle, or a **positive even** number of `'g'`s in the middle.

</div>

<br>

<details>
<summary>
    ✅ Click here to see the answer <b>after</b> you've tried it yourself at <a href='https://regex101.com'>regex101.com</a>.
</summary>

Possible answers: <code>bi(ll(ll)\*|gg(gg)\*)y</code> or <code>bill(ll)\*y|bigg(gg)\*y</code>.
 
<br>

Note, <code>bill(ll)\*|gg(gg)\*y</code> is <b>not</b> a valid answer! This is because "concatenation" comes before "or" in the order of operations. This regular expression would match strings that match <code>bill(ll)\*</code>, like <code>'billll'</code>, OR strings that match <code>gg(gg)\*y</code>, like <code>'ggy'</code>.

    
</details>

## Intermediate regex

---

### More regex syntax

| operation | example | matches ✅ | does not match ❌ |
|:--- |:---|:---|:---|
| <span style='color:purple'><b>wildcard</b></span> | `.U.U.U.` | `'CUMULUS'`<br>`'JUGULUM'` | `'SUCCUBUS'`<br>`'TUMULTUOUS'` |
| <span style='color:purple'><b>character class</b></span>  | `[A-Za-z][a-z]*` | `'word'`<br>`'Capitalized'` | `'camelCase'`<br>`'4illegal'` |
| <span style='color:purple'><b>at least one</b></span> | `bi(ll)+y` | `'billy'`<br>`'billlllly'` | `'biy'`<br>`'bily'` |
| <span style='color:purple'><b>between $i$ and $j$ occurrences</b></span> | `m[aeiou]{1,2}m` | `'mem'`<br>`'maam'`<br>`'miem'` | `'mm'`<br>`'mooom'`<br>`'meme'` |

`.`, `[`, `]`, `+`, `{`, and `}` are also special characters, in addition to `|`, `(`, `)`, and `*`.

***Example (character classes, at least one)***: `[A-E]+` is just shortform for `(A|B|C|D|E)(A|B|C|D|E)*`.

***Example (wildcard)***: 
- What does `.` match? 
- What does `he.` match? 
- What does `...` match?

***Example (at least one, closure)***: 
- What does `123+` match?
- What does `123*` match?

***Example (number of occurrences)***: What does `tri{3, 5}` match? Does it match `'triiiii'`?

***Example (character classes, number of occurrences)***:
What does `[1-6a-f]{3}-[7-9E-S]{2}` match?

<div class="alert alert-success" markdown="1">
    <h3>Activity</h3>

Write a regular expression that matches any lowercase string has a repeated vowel, such as `'noon'`, `'peel'`, `'festoon'`, or `'zeebraa'`.

</div>

<br>

<details>
<summary>
    ✅ Click here to see the answer <b>after</b> you've tried it yourself at <a href='https://regex101.com'>regex101.com</a>.
</summary>

One answer: <code>[a-z]\*(aa|ee|ii|oo|uu)[a-z]\*</code>
 
<br>

This regular expression matches strings of lowercase characters that have <code>'aa'</code>, <code>'ee'</code>, <code>'ii'</code>, <code>'oo'</code>, or <code>'uu'</code> in them anywhere. <code>[a-z]\*</code> means "zero or more of any lowercase characters"; essentially we are saying it doesn't matter what letters come before or after the double vowels, as long as the double vowels exist somewhere.

    
</details>

<div class="alert alert-success" markdown="1">
    <h3>Activity</h3>

Write a regular expression that matches any string that contains **both** a lowercase letter and a number, in any order. Examples include `'billy398'`, `'398!!billy'`, and `'bil3ly98'`.

</div>

<br>

<details>
<summary>
    ✅ Click here to see the answer <b>after</b> you've tried it yourself at <a href='https://regex101.com'>regex101.com</a>.
</summary>

One answer: <code>(.\*[a-z].\*[0-9].\*)|(.\*[0-9].\*[a-z].\*)</code>
 
<br>

We can break the above regex into two parts – everything before the `|`, and everything after the `|`.

The first part, <code>.\*[a-z].\*[0-9].\*</code>, matches strings in which there is at least one lowercase character and at least one digit, with the lowercase character coming first.

The second part, <code>.\*[0-9].\*[a-z].\*</code>, matches strings in which there is at least one lowercase character and at least one digit, with the digit coming first.
    
Note, the <code>.\*</code> between the digit and letter classes is needed in the event the string has non-digit and non-letter characters.
    
<b>This is the kind of task that would be easier to accomplish with regular Python string methods.</b>

    
</details>

### Even more regex syntax

| operation | example | matches ✅ | does not match ❌ |
|:--- |:---|:---|:---|
| <span style='color:purple'><b>escape character</b></span> | `umich\.edu` | `'umich.edu'` | `'umich!edu'` |
| <span style='color:purple'><b>beginning of line</b></span> | `^ark` | `'ark two'`<br>`'ark o ark'` | `'dark'` |
| <span style='color:purple'><b>end of line</b></span>  | `ark$` | `'dark'`<br>`'ark o ark'` | `'ark two'` |
| <span style='color:purple'><b>zero or one</b></span> | `cat?` | `'ca'`<br>`'cat'` | `'cart'` (matches `'ca'` only) |
| <span style='color:purple'><b>built-in character classes*</b></span> | `\w+` <br> `\d+` | `'billy'`<br>`'231231'` | `'this person'`<br>`'858 people'` |
| <span style='color:purple'><b>character class negation</b></span> | `[^a-z]+` | `'WOLVERINE551'`<br>`'1721$$'` | `'porch'`<br>`'billy.edu'` |

****Note***: in Python's implementation of regex,
- `\d` refers to digits.
- `\w` refers to alphanumeric characters (`[A-Z][a-z][0-9]_`). **Whenever we say "alphanumeric" in an assignment, we're referring to `\w`!**
- `\s` refers to whitespace.
- `\b` is a word boundary.

***Example (escaping)***: 
- What does `he.` match? 
- What does `he\.` match? 
- What does `(734)` match? 
- What does `\(734\)` match?

***Example (anchors)***: 
- What does `734-764` match?
- What does `^734-764` match?
- What does `734-764$` match?

***Example (built-in character classes)***:

- What does `\d{3} \d{3}-\d{4}` match?
- What does `\bcat\b` match? Does it find a match in `'my cat is hungry'`? What about `'concatenate'`, `'kitty cat'`, or `'in-the-cat-hat'`?

<br><br>

Remember, in Python's implementation of regex,
- `\d` refers to digits.
- `\w` refers to alphanumeric characters (`[A-Z][a-z][0-9]_`). **Whenever we say "alphanumeric" in an assignment, we're referring to `\w`!**
- `\s` refers to whitespace.
- `\b` is a word boundary.

<div class="alert alert-success" markdown="1">
    <h3>Activity</h3>

Write a regular expression that matches any string that:
- is between 5 and 10 characters long, and
- is made up of only vowels (either uppercase or lowercase, including `'Y'` and `'y'`), periods, and spaces.

Examples include `'yoo.ee.IOU'` and `'AI.I oey'`.

</div>

<br>

<details>
<summary>
    ✅ Click here to see the answer <b>after</b> you've tried it yourself at <a href='https://regex101.com'>regex101.com</a>.
</summary>

One answer: <code>^[aeiouyAEIOUY. ]{5,10}$</code>
 
<br>

<b>Key idea</b>: Within a character class (i.e. <code>[...]</code>), special characters do not generally need to be escaped.


    
</details>

<div class="alert alert-success" markdown="1">
    <h3>Activity</h3><br><small>This is an old exam question!</small>
    
```
^\w{2,5}.\d*\/[^A-Z5]{1,}
```

Select all strings below that contain any match with the regular expression above.

- `"billy4/Za"`
- `"billy4/za"`
- `"DAI_s2154/pacific"`
- `"daisy/ZZZZZ"`
- `"bi_/_lly98"`
- `"!@__!14/atlantic"`

## Regex in Python

---

### `re` in Python

- The `re` module is built into Python. It allows us to use regular expressions to find, extract, and replace strings.

In [None]:
import re

- `re.findall` takes in a string `regex` and a string `text` and returns a list of all matches of `regex` in `text`. **You'll use this most often.**

In [None]:
re.findall('AB*A', 
           'here is a string for you: ABBBA. here is another: ABBBBBBBA')

- `re.sub` takes in a string `regex`, a string `repl`, and a string `text`, and replaces all matches of `regex` in `text` with `repl`.

In [None]:
re.sub('AB*A', 
       'billy', 
       'here is a string for you: ABBBA. here is another: ABBBBBBBA')

### Raw strings

When using regular expressions in Python, it's a good idea to use **raw strings**, denoted by an `r` before the quotes, e.g. `r'exp'`.

In [None]:
re.findall('\bcat\b', 'my cat is hungry')

In [None]:
re.findall(r'\bcat\b', 'my cat is hungry')

In [None]:
# Huh?
print('\bcat\b')

### Capture groups

- Surround a regex with `(` and `)` to define a **capture group** within a pattern. Capture groups are useful for extracting relevant parts of a string.

In [None]:
re.findall(r'\w+@(\w+)\.edu', 
           'my old email was billy@notumich.edu, my new email is notbilly@umich.edu')

- Notice what happens if we remove the `(` and `)`!

In [None]:
re.findall(r'\w+@\w+\.edu', 
           'my old email was billy@notumich.edu, my new email is notbilly@umich.edu')

- Earlier, we also saw that parentheses can be used to group parts of a regex together. When using `re.findall`, all groups are treated as capturing groups.

In [None]:
# A regex that matches strings with two of the same vowel followed by 3 digits.
# We only want to capture the digits, but...
re.findall(r'(aa|ee|ii|oo|uu)(\d{3})', 'eeoo124')

### Example: Extracting hashtags

- The dataset `'data/ira.csv'` contains tweets tagged by Twitter as likely being posted by the [Internet Research Agency](https://en.wikipedia.org/wiki/Internet_Research_Agency), the tweet factory facing allegations for attempting to influence US political elections.<br><small>For more context, read [this Wikipedia article](https://en.wikipedia.org/wiki/Russian_interference_in_the_2016_United_States_elections).</small>

In [None]:
tweets = pd.read_csv('data/ira.csv', names=['id', 'user', 'time', 'text'])
tweets.head()

In [None]:
tweets.shape

- **Question**: What are the most common hashtags among all 9000 tweets?<br><small>A hashtag is any **alphanumeric string** beginning with `'#'`, e.g. `'#GoBlue'`.</small>

### Extracting hashtags

- We can use `re.findall` to find all of the hashtags in a particular string.

In [None]:
example_tweet = tweets['text'].iloc[0]
example_tweet

In [None]:
re.findall(r'#(\w+)', example_tweet) 

In [None]:
re.findall(r'#(\w+)', 'hey there, no hashtags here') 

- We can use the Series `str.findall` method, with the regular expression above, to extract hashtags out of each tweet in `tweets['text']`.

In [None]:
tags = tweets['text'].str.findall(r'#(\w+)') 
tags.head()

- We can use the `sum()` method on the above Series to concatenate all of these lists into a large list!

In [None]:
(
    pd.Series(tags.sum())
    .value_counts()
    .head(15)
    .sort_values()
    .plot(kind='barh', title='Most Common Hashtags in IRA Tweets')
)

### Followup questions

- Which accounts were **tagged** most often?<br><small>For example, in the tweet `'I love being a @UMich student'`, user `'UMich'` is tagged.</small>

- Which accounts tweeted most often?

- Which websites were **linked** most often?

- **Why** were these hashtags used by these accounts?<br><small>Again, read the linked Wikipedia article, and do a bit of your own research! These tweets **aren't** by a random sample of Twitter users.</small>

<div class="alert alert-danger">
    
#### Reference Slide

### Example: Log parsing

- Web servers typically record every request made of them in the "logs".

In [None]:
s = '''132.249.20.188 - - [01/Oct/2024:2:36:15 -0400] "GET /my/home/ HTTP/1.1" 200 2585'''

- Let's use our new regex syntax (including capturing groups) to extract the day, month, year, and time from the log string `s`.

In [None]:
exp = '\[(.+)\/(.+)\/(.+):(.+):(.+):(.+) .+\]'
re.findall(exp, s)

- While above regex works, it is not very **specific**. It _works_ on incorrectly formatted log strings.

In [None]:
other_s = '[adr/jduy/wffsdffs:r4s4:4wsgdfd:asdf 7]'
re.findall(exp, other_s)

<div class="alert alert-danger">
    
#### Reference Slide

### The more specific, the better!    

- Be as specific in your pattern matching as possible – you don't want to match and extract strings that don't fit the pattern you care about.<br><small>`.*` matches every possible string, but we don't use it very often.</small>

- A better date extraction regex:
    ```
    \[(\d{2})\/([A-Z]{1}[a-z]{2})\/(\d{4}):(\d{2}):(\d{2}):(\d{2}) -\d{4}\]
    ```
    - `\d{2}` matches any 2-digit number.
    - `[A-Z]{1}` matches any single occurrence of any uppercase letter.
    - `[a-z]{2}` matches any 2 consecutive occurrences of lowercase letters.
    - Remember, special characters (`[`, `]`, `/`) need to be escaped with `\`.

In [None]:
s

In [None]:
new_exp = '\[(\d{2})\/([A-Z]{1}[a-z]{2})\/(\d{4}):(\d{2}):(\d{2}):(\d{2}) -\d{4}\]'
re.findall(new_exp, s)

- A benefit of `new_exp` over `exp` is that it doesn't capture anything when the string doesn't follow the format we specified.

In [None]:
other_s

In [None]:
re.findall(new_exp, other_s)