String methods such as `split()` are useful for extracting portions of a string.

In [1]:
string = '9 13 and 15 are odd numbers.'
string.split()

['9', '13', 'and', '15', 'are', 'odd', 'numbers.']

This could be used to find the numbers in a string.

In [2]:
string = '9 13 and 15 are odd numbers.'
numbers = []
for item in string.split():
    if item.isdigit():
        numbers.append(item)
print(numbers)

['9', '13', '15']


But input data is typically not as simple. Let's run the solution above on the same string with a comma between 9 and 13.

In [3]:
string = '9, 13 and 15 are odd numbers.'
numbers = []
for item in string.split():
    if item.isdigit():
        numbers.append(item)
print(numbers)

['13', '15']


9 is missing because split returns `'9,'` instead of `'9'`.

In [4]:
string = '9, 13 and 15 are odd numbers.'
string.split()[0]

'9,'

In [5]:
'9,'.isdigit()

False

There are many ways to solve this issue. One way is to remove commas from the string.

In [6]:
string = '9, 13 and 15 are odd numbers.'
string_wo_comma = string.replace(',', '')
numbers = []
for item in string_wo_comma.split():
    if item.isdigit():
        numbers.append(item)
print(numbers)

['9', '13', '15']


But as more complexities arise, it will be harder to handle all using string methods. This task of searching and extracting is so common that Python has a very powerful
module called *regular expressions* that handles many of these tasks quite elegantly.

https://docs.python.org/3/library/re.html

Let's start by importing the module `re`.

In [7]:
import re

Let's solve the above task using regular expressions.

In [8]:
string = '9, 13 and 15 are odd numbers.'
pattern = '[0-9]+'
re.findall(pattern, string)

['9', '13', '15']

No `for` loops, no `if` statements and it works even when you add complexity.

In [9]:
# Find the numbers in a string with regular expressions
string = "9, $13 and (15) are odd numbers."
pattern = '[0-9]+'
re.findall(pattern, string)

['9', '13', '15']

This is the essence of regular expressions. We define a `pattern` and search that pattern inside a `string`.

There are multiple functions in Python's regular expressions library. We will mostly use `re.findall()` in this DataLab but be aware of the others:

https://docs.python.org/3/library/re.html#functions

**But what is the meaning of the pattern `[0-9]+`?**

While they are very powerful, they are a little complicated and their syntax takes
some getting used to. Regular expressions are almost their own little programming language for searching and parsing strings.

Regular expressions are made of special characters. These special characters **match** normal characters we are familiar with. For example `\s` matches whitespace and `.` matches any character. Therefore `\s...\s` will match a sequence of three characters surrounded by spaces e.g. a three letter word.

`[0-9]+` will be explained later.

In [10]:
string = 'I am looking for three letter words.'
pattern = '\s...\s'
re.findall(pattern, string)

[' for ']

Here is a look at character level matches:

| | | | | | |
|:-:|:-:|:-:|:-:|:-:|:-:|
| | | | | | |
|Pattern|`\s` |`.` |`. `|`.` |`\s` |
||&#8595;|&#8595;|&#8595;|&#8595;|&#8595;|
|String|` ` | `f`|`o` |`r` |` ` |


But if the word is at the beginning or the end, it will not match.

In [11]:
string = 'But if the word is at the beginning...'
pattern = '\s...\s'
re.findall(pattern, string)

[' the ', ' the ']

Notice that `'But '` is not a match. Also notice that `re.findall()` can return multiple matches.

Here is a look at character level too see why it does not match "But":

| | | | | | |
|:-:|:-:|:-:|:-:|:-:|:-:|
| | | | | | |
|Pattern|`\s` |`.` |`. `|`.` |`\s` |
||&#10060;|&#8595;|&#8595;|&#8595;|&#8595;|
|String|| `B`|`u` |`t` |` ` |


In [12]:
string = 'It will also match numbers like 123 but not 456'
pattern = '\s...\s'
re.findall(pattern, string)

[' 123 ', ' not ']

| | | | | | |
|:-:|:-:|:-:|:-:|:-:|:-:|
| | | | | | |
|Pattern|`\s` |`.` |`. `|`.` |`\s` |
||&#8595;|&#8595;|&#8595;|&#8595;|&#10060;|
|String| ` `|`4` |`5` |`6` ||

There are many more characters. There is no need to memorize them, you can use cheat sheets such as the one below. This is not exhaustive, but covers most common operations.

**Regular Expressions Cheatsheet**

| Character | Description |
| --- | --- |
|^|Matches the beginning of a line|
|$|Matches the end of the line|
|.|Matches any character|
|\s|Matches whitespace|
|\S|Matches any non-whitespace character|
|\*|Repeats a character zero or more times|
|*?|Repeats a character zero or more times (non-greedy)|
|\+|Repeats a character one or more times|
|+?|Repeats a character one or more times (non-greedy)|
|[aeiou]|Matches a single character in the listed set|
|[^XYZ]|Matches a single character not in the listed set|
|[a-z0-9]|The set of characters can include a range|
|(|Indicates where string extraction is to start|
|)|Indicates where string extraction is to end|

Additionally, there are tools that explain what a given expression matches. These tools are very useful when working with regular expressions.

**Task 1**
 - Go to https://regexr.com/
 - Enter the following regular expression:
         \s...\s
 - Enter the following text:
         I am looking for three letter words.
 - Check the explanation
 - Roll-over elements in the explanation to highlight in the expression above
 - Click on the elements to open them in reference on the left
 - **Use this tool from now on when you are in doubt**

Let's continue exploring more characters. In the previous example, you might have noticed whitespaces are returned.

In [13]:
string = 'I am looking for three letter words.'
pattern = '\s...\s'
re.findall(pattern, string)

[' for ']

But we just want the word. We can wrap the section we want to be returned into parentheses.

Parentheses are another special character in regular expressions. When you add
parentheses to a regular expression, they are ignored when matching the string.
But when you are using `re.findall()`, parentheses indicate that while you want the
whole expression to match, you only are interested in extracting a portion of the
substring that matches the regular expression.

In [14]:
string = 'I am looking for three letter words.'
pattern = '\s(...)\s'
re.findall(pattern, string)

['for']

Now, let's understand the expression `[0-9]+` we saw previously.

- The first step is to understands sets `[ ]`
- The second step is to understand range `[0-9]`
- The third step is to understand is the `+` character.

**Step 1:** A set matches a single character. For example the set `[0123456789]` matches `0` or `1` or `2` or `3` or `4` or `5` or `6` or `7` or `8` or `9`. It does not
match `0123456789`.

In [15]:
# Find single digits
string = '2, 5 and 7 are odd numbers.'
pattern = '[0123456789]'
re.findall(pattern, string)

['2', '5', '7']

| | | | | | |
|:-:|:-:|:-:|:-:|:-:|:-:|
| | | | | | |
|Pattern|`[0123456789]` |
||&#8595;|
|String| `2`|

What happens if the same pattern is applied to two-digit numbers?

In [16]:
string = '11, 13 and 15 are odd numbers.'
pattern = '[0123456789]'
re.findall(pattern, string)

['1', '1', '1', '3', '1', '5']

Since the pattern `[0123456789]` matches single characters, it will match the digits in `11` but as two separate matches. Therefore the return is `['1', '1', ...]` instead of `['11', ...]`.

To match two-digit numbers we can simply use the pattern twice.

In [17]:
string = '11, 13 and 15 are odd numbers.'
pattern = '[0123456789][0123456789]'
re.findall(pattern, string)

['11', '13', '15']

| | | | | | |
|:-:|:-:|:-:|:-:|:-:|:-:|
| | | | | | |
|Pattern|`[0123456789]` |`[0123456789]` |
||&#8595;|&#8595;|
|String| `1`|`1`|

Probably, you are thinking `[0123456789]` is quite verbose. What if we want to match all letters? Do we have to write `[abcdefghijklmnopqrstuvwxyz]`? 

**Step 2:** That would be a nightmare but luckily we can define a range for common sets. 
- `[0123456789]` can be represented as `[0-9]`
- Lower case letters `[a-z]`
- Upper case letters `[A-Z]`

In [18]:
# Single digit with range
string = '2, 5 and 7 are odd numbers.'
pattern = '[0-9]'
re.findall(pattern, string)

['2', '5', '7']

In [19]:
# Two digits with range
string = '11, 13 and 15 are odd numbers.'
pattern = '[0-9][0-9]'
re.findall(pattern, string)

['11', '13', '15']

**Task 2**

Here is your second task. Define a pattern that returns lower case and upper case letters in a string.

Example string: `'Word1 woRd2'`

Output: `['W', 'o', 'r', 'd', 'w', 'o', 'R', 'd']`

In [29]:
string = 'Word1 woRd2'
pattern = '[a-zA-Z]'
re.findall(pattern, string)

['W', 'o', 'r', 'd', 'w', 'o', 'R', 'd']

**Step 3:** But what if there are single and two digit numbers in one string?

In [30]:
# This won't work
string = "9, $13 and (15) are odd numbers."
pattern = '[0-9]'
re.findall(pattern, string)

['9', '1', '3', '1', '5']

In [31]:
# This won't work either
string = "9, $13 and (15) are odd numbers."
pattern = '[0-9][0-9]'
re.findall(pattern, string)

['13', '15']

`+` character repeats a character one or more times

therefore `[0-9]+` will match
- [0-9]
- [0-9][0-9]
- [0-9][0-9][0-9]
- [0-9][0-9][0-9][0-9]
- ...

In [32]:
string = "9, $13 and (15) are odd numbers."
pattern = '[0-9]+'
re.findall(pattern, string)

['9', '13', '15']

Sets can be used to define characters that you don't want to match as well

| Character | Description |
| --- | --- |
|[aeiou]|Matches a single character in the listed set|
|[^XYZ]|Matches a single character not in the listed set|
|[a-z0-9]|The set of characters can include a range|

For example `[^0-9]` would match all the non-number characters:

In [33]:
string = "123 abc 5d"
pattern = '[^0-9]'
re.findall(pattern, string)

[' ', 'a', 'b', 'c', ' ', 'd']

**Task 3:** Find the number of apples in a string using regular expressions. The string can contain `'x apples'` any place.

---

Example string
`'There are 15 apples in the basket.'`

Expected output is `['15']`

---

Example string
`'There are 15 apples and 20 oranges in the basket.'`

Expected output is `['15']`

---

Example string
`'5 apples here 10 apples there.'`

Expected output is `['5', '10']`

---

Example string
`'There is only 1 apple left.'`

Expected output is `['1']`

---
For the following case, output can be an empty list.

Example string
`'I have an apple'`

Expected output is `[]`



In [83]:
string = 'There are 15 apples in the basket.'
pattern = "([0-9]+) apple"
re.findall(pattern, string)

['15']

In [82]:
string = 'There are 15 apples and 10 oranges in the basket.'
pattern = "([0-9]+) apple"
re.findall(pattern, string)

['15']

In [81]:
string = '5 apples here, 10 apples there.'
pattern = "([0-9]+) apple"
re.findall(pattern, string)

['5', '10']

In [79]:
string = 'There is only 1 apple left.'
pattern = "([0-9]+) apple"
re.findall(pattern, string)

['1']

In [80]:
string = 'I have an apple.'
pattern = "([0-9]+) apple"
re.findall(pattern, string)

[]

**Beginning/end of a string**

Sometimes we are interested in matching beginning or end of a string. There are characters for that as well.

| Character | Description |
| --- | --- |
|^|Matches the beginning of a line|
|$|Matches the end of the line|

In [34]:
# Find the number at the beginning of the string 
string = '1-There are 5 apples and 2 oranges in the basket.'
pattern = '^[0-9]+'
re.findall(pattern, string)

['1']

**Escape character**

We use special characters in regular expressions such as a `.`. But what if we would like to match a `.` in a string? We need a way to indicate that these characters are “normal” and we want to match the actual character in a string.

We can indicate that we want to simply match a character by prefixing that character with a backslash.

In [36]:
# Find all abbreviated titles
string = "Dr. A, Ms. B, Mr. C"
pattern = '...' # this won't work
re.findall(pattern, string)

['Dr.', ' A,', ' Ms', '. B', ', M', 'r. ']

In [37]:
# Find all abbreviated titles
string = "Dr. A, Ms. B, Mr. C"
pattern = '..\.' # this will work
re.findall(pattern, string)

['Dr.', 'Ms.', 'Mr.']

**\+ and * are greedy**

\+ and * repeats a character. But it is crucial to understand that they are greedy. Let's examine this behaviour with an example.

Let's say you have the string

`"From: stephen.marquard@uct.ac.za, csev@umich.edu, and cwen @iupui.edu"`

and you would like to get the name of the first person after `'From:'`

we can define a pattern such as
`'From:\s(.+)@'`

expecting it to give us the name between `From:` and `@`. Let's see what happens.

In [38]:
# + will push until the last @ sign
string = "From: stephen.marquard@uct.ac.za, csev@umich.edu, and cwen @iupui.edu"
pattern = 'From:\s(.+)@'
re.findall(pattern, string)

['stephen.marquard@uct.ac.za, csev@umich.edu, and cwen ']

We did not get the first name. But this is not because the pattern is not matching. It is because of the greedy behaviour of the `+` character. If we use `+?`, which is non-greedy, it will give us what we want.

In [39]:
# you can use +? for a non-greedy repetition
string = "From: stephen.marquard@uct.ac.za, csev@umich.edu, and cwen @iupui.edu"
pattern = 'From:\s(.+?)@'
re.findall(pattern, string)

['stephen.marquard']

If you read the `re.findall()` documentation it says the following:

"Return all **non-overlapping matches** of pattern in string, as a list of strings or tuples. The string is **scanned left-to-right**, and matches are returned in the order found. Empty matches are included in the result."

The fact that `re.findall()` reads left-to-right and finds non-overlapping matches has important implications.

Take a look at the example below:

In [40]:
string = "123456"
pattern = '...'
re.findall(pattern, string)

['123', '456']

`'...'` matches `'234'` but it is not returned, why?

This is because:

- `re.findall()` reads from left-to-right,
- finds the first match (`'123'` in this example),
- continues scanning from the next character after the match (`'4'` in this example),
- therefore "non-overlapping".

**Task 4:** Given an arithmetic operation with nested parentheses, return the innermost parentheses and its contents.

---

Example string `"(5 * (3 + 2)) - 7"`

Expected output is ['(3 + 2)']

--- 

Example string `"((7 - 2) * (1 + 2)) / 2"`

Expected output is ['(7 - 2)', '(1 + 2)']

In [3]:
import re

In [6]:
string = "((7 - 2) * (1 + 2)) / 2"
pattern = ".[0-9]\s[+-]\s[0-9]."
re.findall(pattern, string)

['(7 - 2)', '(1 + 2)']

In [7]:
string = "((5 * (3 + 2)) - 7"
pattern = ".[0-9]\s[+-]\s[0-9]."
re.findall(pattern, string)

['(3 + 2)']

We have covered the fundamentals of regular expressions. But there are many more characters. It is helpful to skim a cheatsheet and see what is possible.

https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf

Now to practice regular expressions, please continue with the following tasks.

# Task 5

Given a list of tweets, remove links, hashtags and user handles from the tweets. Tweet processing will be essential for the creative brief. For this task, check the documentation for `re.sub()`.

Example tweet:

`'@BhaktisBanter @PallaviRuhail This one is irresistible :)\n#FlipkartFashionFriday http://t.co/EbZ0L2VENM'`

Expected output:

`'This one is irresistible :)\nFlipkartFashionFriday'`


For this task you will use a sample twitter dataset from the nltk library.

In [8]:
import nltk
from nltk.corpus import twitter_samples
nltk.download('twitter_samples')
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
tweets = all_positive_tweets + all_negative_tweets

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\neilr\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


In [9]:
# Examine the first 10 tweets
tweets[:10]

['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)',
 '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!',
 '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!',
 '@97sides CONGRATS :)',
 'yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days',
 '@BhaktisBanter @PallaviRuhail This one is irresistible :)\n#FlipkartFashionFriday http://t.co/EbZ0L2VENM',
 "We don't like to keep our lovely customers waiting for long! We hope you enjoy! Happy Friday! - LWWF :) https://t.co/smyYriipxI",
 '@Impatientraider On second thought, there’s just not enough time for a DD :) But new shorts entering system. Sheep must be buying.',
 'Jgh , but we have to go to Bayan :D bye',
 'As an act of mischievousness, am calling the ETL layer of our in-house warehousing 

In [17]:
string = '@BhaktisBanter @PallaviRuhail This one is irresistible :)\n#FlipkartFashionFriday http://t.co/EbZ0L2VENM'

# Remove links
processed_tw = re.sub('http\S+', '', string)
# Remove handles
processed_tw = re.sub('@[A-Za-z0-9_]+', '', processed_tw)
# Remove hashtags
processed_tw = re.sub('#', '', processed_tw)

processed_tw.strip()

'This one is irresistible :)\nFlipkartFashionFriday ://t.co/EbZ0L2VENM'

# Task 6

Find all the emails inside the file `'mbox-short.txt'`. It is a collection of email messages and metadata.

If you would like to know more about mbox files, read the following:
https://en.wikipedia.org/wiki/Mbox

In [10]:
# Let's examine the first 10 lines
f = open('mbox-short.txt')
counter = 0
for line in f:
    print(line)
    counter += 1
    if counter>=10:break
f.close()

From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008

Return-Path: <postmaster@collab.sakaiproject.org>

Received: from murder (mail.umich.edu [141.211.14.90])

	 by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;

	 Sat, 05 Jan 2008 09:14:16 -0500

X-Sieve: CMU Sieve 2.3

Received: from murder ([unix socket])

	 by mail.umich.edu (Cyrus v2.2.12) with LMTPA;

	 Sat, 05 Jan 2008 09:14:16 -0500

Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [141.211.14.79])



Designing a pattern matching an email address requires the knowledge of syntax rules. The format of an email address is `local-part@domain`. Syntax rules for the local-part and the domain are different from each other and complex.

For example the following is a valid email address:

`"very.(),:;<>[]\".VERY.\"very@\\ \"very\".unusual"@strange.example.com`

You can find all the rules here:
https://en.wikipedia.org/wiki/Email_address#Syntax

**For this task assume that a valid email address can contain:**

- lowercase Latin letters `a` to `z`
- digits `0` to `9`
- dot `.`

and nothing else. You should be able to find a total of 305 emails, 16 of which are unique.

|id|email|count|
|---|---|---|
|0|`source@collab.sakaiproject.org`|135|
|1|`postmaster@collab.sakaiproject.org`|27
|2|`apache@localhost`|27|
|3|`cwen@iupui.edu`|20|
|4|`zqian@umich.edu`|17|
|5|`david.horwitz@uct.ac.za`|17|
|6|`louis@media.berkeley.edu`|12|
|7|`gsilver@umich.edu`|12|
|8|`stephen.marquard@uct.ac.za`|8|
|9|`rjlowe@iupui.edu`|8|
|10|`wagnermr@iupui.edu`|6|
|11|`antranig@caret.cam.ac.uk`|4|
|12|`gopal.ramasammycook@gmail.com`|4|
|13|`ray@media.berkeley.edu`|4|
|14|`hu2@iupui.edu`|2|
|15|`josrodri@iupui.edu`|2|


In [20]:
import re
from prettytable import PrettyTable

# Define the pattern for matching email addresses
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

# Open the file and read its contents
with open('mbox-short.txt', 'r') as f:
    content = f.read()

# Find all email addresses in the content using the defined pattern
emails = re.findall(email_pattern, content)

# Count occurrences of each unique email address
email_counts = {}
for email in emails:
    email_counts[email] = email_counts.get(email, 0) + 1

# Create a PrettyTable instance
table = PrettyTable(['id', 'email', 'count'])

# Add data to the table
for idx, (email, count) in enumerate(email_counts.items()):
    table.add_row([idx, email, count])

# Print the table
print(table)

+----+-----------------------------------------------------+-------+
| id |                        email                        | count |
+----+-----------------------------------------------------+-------+
| 0  |              stephen.marquard@uct.ac.za             |   8   |
| 1  |          postmaster@collab.sakaiproject.org         |   27  |
| 2  | 200801051412.m05ECIaH010327@nakamura.uits.iupui.edu |   1   |
| 3  |            source@collab.sakaiproject.org           |  135  |
| 4  |               louis@media.berkeley.edu              |   12  |
| 5  | 200801042308.m04N8v6O008125@nakamura.uits.iupui.edu |   1   |
| 6  |                   zqian@umich.edu                   |   17  |
| 7  | 200801042109.m04L92hb007923@nakamura.uits.iupui.edu |   1   |
| 8  |                   rjlowe@iupui.edu                  |   8   |
| 9  | 200801042044.m04Kiem3007881@nakamura.uits.iupui.edu |   1   |
| 10 | 200801042001.m04K1cO0007738@nakamura.uits.iupui.edu |   1   |
| 11 | 200801041948.m04JmdwO007705