# Regex with Python

This notebook introduces use of regular expressions in Python. The example
is structured around using regular expressions to match American-style dates
in filenames (that is, day-month-year), reorder those into a standard format, and
rename the files. 

## Python libraries/modules used

* regular expressions (`re`)
* interacting with the filesystem (using the `os` module)


## Key points / Questions

* How do I implement regular expressions in Python?
* How can I create and reuse a regular expression?
* What are regular expression groups?
* How do I use a regular expression to match a string?
* Can I use groups to extract and modify various data points? 

## Setup

First import `re`, the module for regular expressions.

In [1]:
import re

Now, create some test data to use with later. This is a long string with various common patterns in it, like phone numbers, email addresses, URLs, and subject headings.

In [2]:
# test string
test_string = """My number is 734-764-1817.
Or (734) 764-1817 (phone)
734 746-1817 (fax)
123-567-1839
+1 123-567-1839
(202) 388-6389

librarian@library.org
marian@rivercity-library.net
librarian@library-carpentry.net
superman@sheroes.net
superwoman@sheroes.net

http://www.librarycarpentry.org
http://www.library-carpentry.net 
https://www2.librarycarpentry.com
http://disciplineoforganizing.org
http://subdomain.ebook-central.edu
http://library.com
http://www.google.com

Computer programming
Libraries--Computer programs--Customizing

doge.jpg
doge.jpeg
doge.JPEG
doge.jpeg2000
doge.doc
doge.docx
doge.htm
doge.html
doge.HTML"""

## Create a regex pattern

Using `re` module, let's try some regexes with the file names we have. To set up a regex in python, use the `re.compile()` function.

Let's look for phone numbers first. What's the pattern?

_Hint:_ 3 numbers, 3 numbers, 4 numbers; but, there can also be some special cases like parentheses, hyphens.

In [3]:
phone_number_regex = re.compile(r"\(?\d{3}\)?-? ?\d{3}-\d{4}")

## Match a regex pattern

Now, there are a few ways that we can reuse that: 

* `.search()` - returns the first match it finds
* `.findall()` - returns all matches it finds; if there are no groups, it returns a list of the matches; if there are groups, it returns a list of tuples with the matches
* `.finditer()` - returns an "iterable", that is, something we can loop through like a list

In the examples below, `mo` is used as a common variable indicating "match object", though this is like any other variable and you can name it what you want. You may see this abbreviated to `m` in the Python documentation or other projects.

In [4]:
mo = re.search(phone_number_regex, test_string)

print(mo)

<re.Match object; span=(13, 25), match='734-764-1817'>


In [5]:
re.findall(phone_number_regex, test_string)

['734-764-1817',
 '(734) 764-1817',
 '734 746-1817',
 '123-567-1839',
 '123-567-1839',
 '(202) 388-6389']

In [6]:
for mo in re.finditer(phone_number_regex, test_string):
    print(mo.span(), mo.group())

(13, 25) 734-764-1817
(30, 44) (734) 764-1817
(53, 65) 734 746-1817
(72, 84) 123-567-1839
(88, 100) 123-567-1839
(101, 115) (202) 388-6389


Note that the match object spans correspond to the start and end positions. 
That can be used to slice the string, for example:

In [7]:
test_string[13:25]

'734-764-1817'

# Using groups

Now, we can add groups, say for the area code and the rest of the number. First, `.compile()` the regex pattern:

In [8]:
phone_number_regex_with_groups = re.compile(r"\(?(\d\d\d)\)?-? ?(\d{3}-\d{4})")

Next, search. This time, other methods are available:

* `.group()` matches a specific group (you can specify the group by passing a number in as the argument
* `.groups()` returns a tuple of all of the groups in the match

When used with groups, `.findall()` returns a list of tuples, so we can reference things using their index in the list (`mo[i]`) and then access the elements of the match using their index in the tuple (`mo[i][index_in_tuple]`). 

In [9]:
mo = re.findall(phone_number_regex_with_groups, test_string)

for i in range(0,len(mo)):
    print(mo[i][0], '|', mo[i][1])

734 | 764-1817
734 | 764-1817
734 | 746-1817
123 | 567-1839
123 | 567-1839
202 | 388-6389


Or, use the `.finditer()`, which we can iterate through as in a list and then use the `.group()` method to reference the groups by their number:

From `phone_number_regex_with_groups`:

| `\(?` | `(\d\d\d)` | `\)?-? ?` | `(\d{3}-\d{4})` |
| --- | --- | --- | --- |
| ungrouped | Group 1 | ungrouped | Group 2 |

In [10]:
for mo in re.finditer(phone_number_regex_with_groups, test_string):
    print(mo)

<re.Match object; span=(13, 25), match='734-764-1817'>
<re.Match object; span=(30, 44), match='(734) 764-1817'>
<re.Match object; span=(53, 65), match='734 746-1817'>
<re.Match object; span=(72, 84), match='123-567-1839'>
<re.Match object; span=(88, 100), match='123-567-1839'>
<re.Match object; span=(101, 115), match='(202) 388-6389'>


In [11]:
for mo in re.finditer(phone_number_regex_with_groups, test_string):
    print('positions:', mo.span(), '\tGroup 1:', mo.group(1), '| Group 2:', mo.group(2))

positions: (13, 25) 	Group 1: 734 | Group 2: 764-1817
positions: (30, 44) 	Group 1: 734 | Group 2: 764-1817
positions: (53, 65) 	Group 1: 734 | Group 2: 746-1817
positions: (72, 84) 	Group 1: 123 | Group 2: 567-1839
positions: (88, 100) 	Group 1: 123 | Group 2: 567-1839
positions: (101, 115) 	Group 1: 202 | Group 2: 388-6389


### How about searching for URLs?

In [12]:
url_regex = re.compile(r'https?:\/\/www\..*\.com')

In [13]:
mo = re.search(url_regex, test_string)

print(mo)

<re.Match object; span=(433, 454), match='http://www.google.com'>


In [14]:
mo = re.findall(url_regex, test_string)

print(mo)

['http://www.google.com']


But there are a lot of patterns that aren't matching, including `https` URLs, those with subdomains other than `www` (or no subdomain), and those that are not `.com` sites. Can you develop some refinements that will match those other patterns? Here's one way using groups: 

In [15]:
url_regex_with_groups = re.compile(r"(http|https)(:\/\/)(\w*)\.(.*)\.(org|net|com|edu)")

In [17]:
mo = re.findall(url_regex_with_groups, test_string)

print(mo)

[('http', '://', 'www', 'librarycarpentry', 'org'), ('http', '://', 'www', 'library-carpentry', 'net'), ('https', '://', 'www2', 'librarycarpentry', 'com'), ('http', '://', 'subdomain', 'ebook-central', 'edu'), ('http', '://', 'www', 'google', 'com')]


For an even more detailed parsing, try nested groups to separate out the basic domain (.com, .edu) from the named domain (site.com). Also, we can make the subdomain optional, so it matches domains with no subdomain:

In [18]:
url_regex_with_nested_groups = re.compile(r"(http|https)(:\/\/)((\w*)\.)?((.*)\.(org|net|com|edu))")

Using `.finditer()` and the `.groups()` method, we can see that these are tuples: 

In [19]:
for url_mo in re.finditer(url_regex_with_nested_groups, test_string):
    print(type(url_mo.groups()),len(url_mo.groups()),url_mo.groups())

<class 'tuple'> 7 ('http', '://', 'www.', 'www', 'librarycarpentry.org', 'librarycarpentry', 'org')
<class 'tuple'> 7 ('http', '://', 'www.', 'www', 'library-carpentry.net', 'library-carpentry', 'net')
<class 'tuple'> 7 ('https', '://', 'www2.', 'www2', 'librarycarpentry.com', 'librarycarpentry', 'com')
<class 'tuple'> 7 ('http', '://', None, None, 'disciplineoforganizing.org', 'disciplineoforganizing', 'org')
<class 'tuple'> 7 ('http', '://', 'subdomain.', 'subdomain', 'ebook-central.edu', 'ebook-central', 'edu')
<class 'tuple'> 7 ('http', '://', None, None, 'library.com', 'library', 'com')
<class 'tuple'> 7 ('http', '://', 'www.', 'www', 'google.com', 'google', 'com')


## Match file extensions

Let's create one that will match different versions of file extensions. What is the basic pattern of a file extension? Generally, a filename ends with three lowercase letters following a dot/period. 

To ensure that the three suffix letters are indeed at the end, assert that it is the end of the line using the `\b` selector for word boundary at the end of the regex.

In [20]:
extension_matcher_basic = re.compile(r'.*\.\w\w\w\b')

To "match" the regex, I use a variable `mo` (match object), although you could use anything else that you want.

For a basic search, use the `.match()` function:

In [21]:
filename_test_string = """doge.jpg
doge.jpeg
doge.JPEG
doge.jpeg2000
doge.doc
doge.docx
doge.htm
doge.html
doge.HTML"""

In [22]:
mo = re.findall(extension_matcher_basic, filename_test_string)

print(mo)

['doge.jpg', 'doge.doc', 'doge.htm']


But as you can see, the three letter extension may be too specific. Here are the filenames we're searching: 

```txt
doge.jpg
doge.jpeg
doge.JPEG
doge.jpeg2000
doge.doc
doge.docx
doge.htm
doge.html
doge.HTML
```

All of these are valid filenames, but clearly not all are matched by our basic filename matcher expression. 

So, let's make it more complex: 

* must match lower and uppercase (case insensitive)
* often 3 or 4 characters
* sometimes more (but maybe we can treat these as edge cases one by one)
* when you look at the above, note that it is a complex pattern but there are a limited number of options

Let's start with the general pattern and more than three letters, but limit to alphanumeric characters:

In [23]:
filename_regex_more_complex = re.compile(r'.*\.[a-zA-Z0-9]{3,8}\b')

In [24]:
mo = re.findall(filename_regex_more_complex, filename_test_string)

print(mo)

['doge.jpg', 'doge.jpeg', 'doge.JPEG', 'doge.jpeg2000', 'doge.doc', 'doge.docx', 'doge.htm', 'doge.html', 'doge.HTML']


Now, the expression matches any alphanumeric string of 3 to 8 characters following a `.` and 
up to a whitespace (word boundary).

------
Synthesis: regex + file manipulation

## Joining it together: regex and file renaming

This section requires `re`, `os`, and `shutils`.

In [25]:
import os

Create list of the files we want to parse (in `data/shell-tests`)

In [27]:
data_folder = 'data'
files_folder = 'filenames-with-american-dates'

#file_list = os.listdir(os.path.join(os.getcwd(), data_folder, files_folder))
american_date_filenames = os.listdir(os.path.join(os.getcwd(), data_folder, files_folder))

print('Found',len(american_date_filenames),'files:',american_date_filenames)

Found 7 files: ['diary-04-23-20.docx', 'observations-03-30-2018.csv', '08-12-1997-items.xlsx', 'diary-04-23-19.doc', '08-12-1997-items.xls', 'books-on-shelves12-3-2002.txt', 'sightings-202203.jpg']


Now we'll use an additional feature of regex building in Python, the `re.VERBOSE` option. You can add this to your compile string to allow for the inclusion of whitespace in between elements of the regex, such as groups. In this case, I want to separate out the parts because it makes the groups (a bit) easier to track

Hint: for more complex regexes like this next one, use a regex tool like [Regex101](https://regex101.com/) or [RegEx Pal](https://www.regexpal.com/) to develop your pattern string first. Easier than working like this in Python. 

In [28]:
US_date_pattern = re.compile(r"""^(.*?) # Group 1: all text that might be before the date 
                           ((0|1)?\d)-   #  match a 1- or 2-digit month, assuming separator is a hyphen
                           ((0|1|2|3)?\d)- #  match a 1- or 2-digit day
                           ((19|20)?\d\d) #  match a 2-digit year or 4-digit year, 1900s or 2000s
                           (.*?)$       #  suffix
                           """, re.VERBOSE) # the re.VERBOSE argument allows us to use this sort of extended display, which can be useful for long or complicated expressions

So, let's check it. Questions to consider: 

* is the pattern matching all of the things we want it to? 
* are there any non-"American" dates that it matches? 

Okay, let's try it out and see if we can assign the groups to variables, so we can use them later on: 

In [29]:
for file in american_date_filenames:
    mo = US_date_pattern.match(file)
    
    # continue to the next if it doesn't match
    if mo == None:
        print("no match:",file)
        continue
    
    prefix    = mo.group(1)
    monthPart = mo.group(2)
    dayPart   = mo.group(4)
    yearPart  = mo.group(6)
    suffix    = mo.group(8)
    
    print("Retrieved:", file, "\nprefix:", prefix, "\nmonth:", monthPart, "\nday:", dayPart, "\nyear:", yearPart, "\nsuffix:", suffix, "\n")

Retrieved: diary-04-23-20.docx 
prefix: diary- 
month: 04 
day: 23 
year: 20 
suffix: .docx 

Retrieved: observations-03-30-2018.csv 
prefix: observations- 
month: 03 
day: 30 
year: 2018 
suffix: .csv 

Retrieved: 08-12-1997-items.xlsx 
prefix:  
month: 08 
day: 12 
year: 1997 
suffix: -items.xlsx 

Retrieved: diary-04-23-19.doc 
prefix: diary- 
month: 04 
day: 23 
year: 19 
suffix: .doc 

Retrieved: 08-12-1997-items.xls 
prefix:  
month: 08 
day: 12 
year: 1997 
suffix: -items.xls 

Retrieved: books-on-shelves12-3-2002.txt 
prefix: books-on-shelves 
month: 12 
day: 3 
year: 2002 
suffix: .txt 

no match: sightings-202203.jpg


TODO: now that our match is working, use the `os` tools to rename the files with datestrings consistent with ISO 8601. That is, YYYY-MM-DD: 

## Practice Questions

These are inspired in part from Al Sweigart, _Automate the Boring Stuff with Python_ (2015 version).

1. What function do you use in Python to create a regex pattern?
1. What does the re `.search()` method return?
1. How can you display or return the strings that are matched? (The use fo `group` can be a bit unexpected in the implementation of this python library.)
1. Can you write a regex that matches a number with commas for every three digits? It would match:
  * 42
  * 1,234
  * 6,368,745
  * but not match:
  * 12,24,567 (2 digits)
  * 1234 (no commas)
1. The `findall()` method can return a list of strings or a list of tuples. Why would it return one or the other? 