# Regular Expressions
RegEx or regular expressions is a sequence of characters that match other strings or sets of strings, using a specialized syntax pattern. Python has a built-in package called re, which can be used to work with regular expressions. To use the re package, import re.

## Raw Strings

To avoid Python escaping the RegEx patterns, prefix the patter with 'r'.

## Regex Cheat Sheet

source: https://regexone.com

For an excellent interactive tutorial, go to https://regexone.com/lesson/introduction_abcs

To test and learn more about RegEx, https://regexr.com/ is also a helpful site.

abc…	Letters<br>
123…	Digits<br>
\d	Any Digit<br>
\D	Any Non-digit character<br>
.	Any Character<br>
\.	Period<br>
[abc]	Only a, b, or c<br>
[^abc]	Not a, b, nor c<br>
[a-z]	Characters a to z<br>
[0-9]	Numbers 0 to 9<br>
\w	Any Alphanumeric character<br>
\W	Any Non-alphanumeric character<br>
{m}	m Repetitions<br>
{m,n}	m to n Repetitions<br>
\*	Zero or more repetitions<br>
\+	One or more repetitions<br>
?	Optional character<br>
\s	Any Whitespace<br>
\S	Any Non-whitespace character<br>
^…$	Starts and ends<br>
(…)	Capture Group<br>
(a(bc))	Capture Sub-group<br>
(.*)	Capture all<br>
(abc|def)	Matches abc or def<br>

In [37]:
# Import the built-in Regular Expressions package
import re
import pprint as pp

email_header = "Fire: From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 Return-Path: <postmaster@collab.sakaiproject.org> for <source@collab.sakaiproject.org>;Received: (from apache@localhost) Author:  stephen.marquard@uct.ac"

found_text = re.findall('^F.+:', email_header)
print(found_text)
print("found text is of type",type(found_text))

# author = re.findall('Author:\s+\S+', email_header)
# print(author)

['Fire: From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 Return-Path: <postmaster@collab.sakaiproject.org> for <source@collab.sakaiproject.org>;Received:']
found text is of type <class 'list'>


In [39]:
# Import the built-in Regular Expressions package
import re
import pprint as pp

email_header = "Fire: From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 Return-Path: <postmaster@collab.sakaiproject.org> for <source@collab.sakaiproject.org>;Received: (from apache@localhost) Author:  stephen.marquard@uct.ac"

found_text = re.findall('\d\d:\d\d:\d\d', email_header)
print(found_text)
print("found text is of type",type(found_text))

author = re.findall('Author:\s+\S+', email_header)
print(author)

['09:14:16']
found text is of type <class 'list'>
['Author:  stephen.marquard@uct.ac']


In [None]:
mboxfile = open("files/mbox-short.txt", "r")

for line in mboxfile:
    line = line.rstrip()
    
    # Search for lines that start with 'F', followed by 2 characters, followed by 'm:'
    if re.search('F..m:', line):        
        print(line)
mboxfile.close()

In [None]:
# Store all email addresses into a list (deep treatment of finding emails usign Regex: https://www.regular-expressions.info/email.html)
mboxfile = open("files//mbox-short.txt", "r")
all_emails_list = []
for line in mboxfile:
    line = line.rstrip()
    #x = re.findall('\S+@\S+\.\D\D\D', line)
    x = re.findall('[A-Za-z0-9._%+-]+@(?:[A-Za-z0-9-]+\.)+[A-Za-z]{2,}', line)
    if len(x) > 0:
        all_emails_list.extend(x)
print(len(all_emails_list))
pp.pprint(all_emails_list) # Many duplicate emails


In [None]:
mystring = 'hello, a my friend'
mystring.title()

In [None]:
list((set(all_emails_list)))

In [None]:
mboxfile = open("files\\mbox-short.txt", "r")
all_emails_list = []
for line in mboxfile:
    line = line.rstrip()
    x = re.findall('rev=.....', line)
    if len(x) > 0:
        all_emails_list.extend(x)

print(all_emails_list)
all_revs_set = set(all_emails_list)
print(len(all_revs_set))