
# Python Regular Expressions

Regular expressions are a powerful language for matching text patterns. This notebook gives a basic introduction to regular expressions and shows how regular expressions work in Python. 

The Python "re" module provides regular expression support.


In Python a regular expression search is typically written as:

 `match = re.search(pat, str)`
 

* The `re.search()` method takes a regular expression pattern and a string and searches for that pattern within the string. 
* If the search is successful, search() returns a match object or None otherwise. 

Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded, as shown in the following example which searches for the pattern 'word:' followed by a 3 letter word (details below):

In [1]:
import re # importing re module
str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:
  print ('found', match.group()) ## 'found word:cat'
else:
  print ('did not find')

found word:cat


The code match = re.search(pat, str) stores the search result in a variable named "match". Then the if-statement tests the match -- if true the search succeeded and match.group() is the matching text (e.g. 'word:cat'). Otherwise if the match is false (None to be more specific), then the search did not succeed, and there is no matching text.


The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions.

## Basic Examples

The basic rules of regular expression search for a pattern within a string are:

* The search proceeds through the string from start to end, stopping at the first match found
* All of the pattern must be matched, but not all of the string
* If `match = re.search(pat, str)` is successful, match is not None and in particular match.group() is the matching text





In [0]:
## Search for pattern 'iii' in string 'piiig'.
## All of the pattern must match, but it may appear anywhere.
## On success, match.group() is matched text.
match = re.search(r'iii', 'piiig')  ## found, match.group() == "iii"
match = re.search(r'igs', 'piiig')  ## not found, match == None

## . = any char but \n
match = re.search(r'..g', 'piiig')  ## found, match.group() == "iig"

## \d = digit char, \w = word char
match = re.search(r'\d\d\d', 'p123g')  ## found, match.group() == "123"
match = re.search(r'\w\w\w', '@@abcd!!')  ## found, match.group() == "abc"

## Repetition

Things get more interesting when you use + and * to specify repetition in the pattern

* \+ -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
* \* -- 0 or more occurrences of the pattern to its left
* ? -- match 0 or 1 occurrences of the pattern to its left

### Leftmost & Largest
First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible -- i.e. + and * go as far as possible (the + and * are said to be "greedy").




In [0]:
## i+ = one or more i's, as many as possible.
match = re.search(r'pi+', 'piiig')  ## found, match.group() == "piii"

## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of i's.
match = re.search(r'i+', 'piigiiii')  ## found, match.group() == "ii"

## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx')  ## found, match.group() == "1 2   3"
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx')  ## found, match.group() == "12  3"
match = re.search(r'\d\s*\d\s*\d', 'xx123xx')  ## found, match.group() == "123"

## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foobar')  ## not found, match == None
## but without the ^ it succeeds:
match = re.search(r'b\w+', 'foobar')  ## found, match.group() == "bar"


## Emails Example

Suppose you want to find the email address inside the string 'xyz alice-b@google.com purple monkey'. We'll use this as a running example to demonstrate more regular expression features. Here's an attempt using the pattern r'\w+@\w+':

In [3]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', str)
if match:
  print (match.group())  ## 'b@google'

b@google


The search does not get the whole email address in this case because the \w does not match the '-' or '.' in the address. We'll fix this using the regular expression features below.

## Square Brackets

Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. For the emails problem, the square brackets are an easy way to add '.' and '-' to the set of chars which can appear around the @ with the pattern r'[\w.-]+@[\w.-]+' to get the whole email address:

In [4]:
match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
  print (match.group())  ## 'alice-b@google.com'

alice-b@google.com


(More square-bracket features) You can also use a dash to indicate a range, so [a-z] matches all lowercase letters. To use a dash without indicating a range, put the dash last, e.g. [abc-]. An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or 'b'.

## Group Extraction

The "group" feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. 

To do this, add parenthesis ( ) around the username and host in the pattern, like this: `r'([\w.-]+)@([\w.-]+)'`. 

In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain match.group() is still the whole match text as usual.

In [5]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'([\w.-]+)@([\w.-]+)', str)
if match:
  print (match.group())   ## 'alice-b@google.com' (the whole match)
  print (match.group(1))  ## 'alice-b' (the username, group 1)
  print (match.group(2))  ## 'google.com' (the host, group 2)

alice-b@google.com
alice-b
google.com


A common workflow with regular expressions is that you write a pattern for the thing you are looking for, adding parenthesis groups to extract the parts you want.

## findall

`findall()` is probably the single most powerful function in the re module. Above we used `re.search()` to find the first match for a pattern. findall() finds *all* the matches and returns them as a list of strings, with each string representing one match.

In [6]:
## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
for email in emails:
  # do something with each found email string
  print (email)

alice@google.com
bob@abc.com


## findall With Files

For files, you may be in the habit of writing a loop to iterate over the lines of the file, and you could then call findall() on each line. Instead, let findall() do the iteration for you -- much better! Just feed the whole file text into findall() and let it return a list of all the matches in a single step (recall that f.read() returns the whole text of a file in a single string):

In [0]:
# Open file
f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'some pattern', f.read())

## findall and Groups

The parenthesis ( ) group mechanism can be combined with findall(). 
* If the pattern includes 2 or more parenthesis groups, then instead of returning a list of strings, findall() returns a list of *tuples*. 
* Each tuple represents one match of the pattern, and inside the tuple is the group(1), group(2) .. data. 
* So if 2 parenthesis groups are added to the email pattern, then findall() returns a list of tuples, each length 2 containing the username and host, e.g. ('alice', 'google.com').
* Once you have the list of tuples, you can loop over it to do some computation for each tuple
*  If the pattern includes no parenthesis, then findall() returns a list of found strings as in earlier examples
* If the pattern includes a single set of parenthesis, then findall() returns a list of strings corresponding to that single group



In [8]:
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)
print (tuples)  ## [('alice', 'google.com'), ('bob', 'abc.com')]
for tuple in tuples:
  print (tuple[0])  ## username
  print (tuple[1])  ## host

[('alice', 'google.com'), ('bob', 'abc.com')]
alice
google.com
bob
abc.com


## split with regex

The string method split() is the right tool in many cases, but what, if you want e.g. to get the bare words of a text, i.e. without any special characters and whitespaces. If we want this, we have to use the split function from the re module. We illustrate this method with a short text from the beginning of Metamorphoses by Ovid:

In [11]:
metamorphoses = "OF bodies chang'd to various forms"
re.split("\W+",metamorphoses) ## ['OF', 'bodies', 'chang', 'd', 'to', 'various', 'forms']

['OF', 'bodies', 'chang', 'd', 'to', 'various', 'forms']

The following example is a good case, where the regular expression is really superior to the string split. Let's assume that we have data lines with surnames, first names and professions of names. We want to clear the data line of the superfluous and redundant text descriptions, i.e. "surname: ", "prename: " and so on, so that we have solely the surname in the first column, the first name in the second column and the profession in the third column: 

In [17]:
lines = ["surname:Obama, prename:Barack, profession:president", "surname:Merkel, prename:Angela, profession:chancellor"]
for line in lines:
  print(re.split(",* *\w*:", line))  ## ['', 'Obama', 'Barack', 'president'] ['', 'Merkel', 'Angela', 'chancellor']

['', 'Obama', 'Barack', 'president']
['', 'Merkel', 'Angela', 'chancellor']


We can easily improve the script by using a slice operator, so that we don't have the empty string as the first element of our result lists: 

In [18]:
lines = ["surname:Obama, prename:Barack, profession:president", "surname:Merkel, prename:Angela, profession:chancellor"]
for line in lines:
  print(re.split(",* *\w*:", line)[1:])  ## ['Obama', 'Barack', 'president'] ['Merkel', 'Angela', 'chancellor']

['Obama', 'Barack', 'president']
['Merkel', 'Angela', 'chancellor']


## Substitution

The re.sub(pat, replacement, str) function searches for all the instances of pattern in the given string, and replaces them. The replacement string can include '\1', '\2' which refer to the text from group(1), group(2), and so on from the original matching text.

Here's an example which searches for all the email addresses, and changes them to keep the user (\1) but have yo-yo-dyne.com as the host.

In [9]:
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
## re.sub(pat, replacement, str) -- returns new string with all replacements,
## \1 is group(1), \2 group(2) in the replacement
print (re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', str))
## purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher

purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher


## Options

The re functions take options to modify the behavior of the pattern match. The option flag is added as an extra argument to the search() or findall() etc., e.g. `re.search(pat, str, re.IGNORECASE)`.

* IGNORECASE -- ignore upper/lowercase differences for matching, so 'a' matches both 'a' and 'A'.
* DOTALL -- allow dot (.) to match newline -- normally it matches anything but newline. This can trip you up -- you think .`*` matches everything, but by default it does not go past the end of a line. Note that \s (whitespace) includes newlines, so if you want to match a run of whitespace that may include a newline, you can just use \s*
* MULTILINE -- Within a string made of many lines, allow ^ and  `$` to match the start and end of each line. Normally `^/$` would just match the start and end of the whole string.


---



# Assignment 1 - Regex 

* The files for this assignment are in the "babynames" directory [here](https://github.com/screel-labs/ml-lite/tree/master/Python-Tutorials/exercises).
* Edit the code in babynames.py. 
* The files baby1990.html baby1992.html ... contain raw html, similar to what you get visiting the above social security site. Take a look at the html and think about how you might scrape the data out of it.


### Part A

* In the babynames.py file, implement the extract_names(filename) function which takes the filename of a baby1990.html file and returns the data from the file as a single list -- the year string at the start of the list followed by the name-rank strings in alphabetical order. ['2006', 'Aaliyah 91', 'Abagail 895', 'Aaron 57', ...]. 

* Modify main() so it calls your extract_names() function and prints what it returns (main already has the code for the command line argument parsing).

* Rather than treat the boy and girl names separately, we'll just lump them all together. In some years, a name appears more than once in the html, but we'll just use one number per name. (*Optional: make the algorithm smart about this case and choose whichever number is smaller.*)


Here are some suggested milestones:

* Extract all the text from the file and print it
* Find and extract the year and print it
* Extract the names and rank numbers and print them
* Get the names data into a dict and print it
* Build the [year, 'name rank', ... ] list and print it
* Fix main() to use the ExtractNames list
* The summary text should look like this for each file:

> 2006

> Aaliyah 91

> Aaron 57

> Abagail 895

> Abbey 695

> Abbie 650

> ...
---

### Part B

Suppose instead of printing the text to standard out, we want to write files containing the text. If the flag --summaryfile is present, do the following: for each input file 'foo.html', instead of printing to standard output, write a new file 'foo.html.summary' that contains the summary text for that file.

Once the --summaryfile feature is working, run the program on all the files using \* like this: "./babynames.py --summaryfile baby\*.html". This generates all the summaries in one step. (The standard behavior of the shell is that it expands the "baby\*.html" pattern into the list of matching filenames, and then the shell runs babynames.py, passing in all those filenames in the sys.argv list.)



---



**Few Tips:** 
* Build the program as a series of small milestones, getting each step to run/print something before trying the next step. This is the pattern used by experienced programmers -- build a series of incremental milestones, each with some output to check, rather than building the whole program in one huge step.

* Printing the data you have at the end of one milestone helps you think about how to re-structure that data for the next milestone. Python is well suited to this style of incremental development. For example, first get it to the point where it extracts and prints the year and calls sys.exit(0). 


---



# Python Utilities

In this section, we look at a few of Python's many standard utility modules to solve common problems.

## File System -- os, os.path, shutil

The *os* and *os.path* modules include many functions to interact with the file system. The *shutil* module can copy files.

* [os module docs](https://docs.python.org/3/library/os.html)
* filenames = os.listdir(dir) -- list of filenames in that directory path (not including . and ..). The filenames are just the names in the directory, not their absolute paths.
* os.path.join(dir, filename) -- given a filename from the above list, use this to put the dir and filename together to make a path
* os.path.abspath(path) -- given a path, return an absolute form, e.g. /home/nick/foo/bar.html
* os.path.dirname(path), os.path.basename(path) -- given dir/foo/bar.html, return the dirname "dir/foo" and basename "bar.html"
* os.path.exists(path) -- true if it exists
* os.mkdir(dir_path) -- makes one dir, os.makedirs(dir_path) makes all the needed dirs in this path
* os.getcwd() -- gets your current working directory
* os.rmdir(dir_path) -- removes a directory
* shutil.copy(source-path, dest-path) -- copy a file (dest path directories should exist)

In [0]:
import os
## Example pulls filenames from a dir, prints their relative and absolute paths
def printdir(dir):
  filenames = os.listdir(dir)
  for filename in filenames:
    print (filename)  ## foo.txt
    print (os.path.join(dir, filename)) ## dir/foo.txt (relative to current dir)
    print (os.path.abspath(os.path.join(dir, filename))) ## /home/nick/dir/foo.txt

With the [shutil module](https://docs.python.org/3/library/shutil.html), you can automate copying both the files and folders. This module follows an optimized design. It saves you from doing the time-intensive operations like the opening, reading, writing, and closing of a file when there is no real processing need. It is full of utility functions and methods which can let you do tasks like copying, moving or removing files and folders

The copy() method functions like the “cp” command in Unix. It means if the target is a folder, then it’ll create a new file inside it with the same name (basename) as the source file. Also, this method will sync the permissions of the target file with the source after copying its content. It too throws the SameFileError if you are copying the same file.

`copy(source_file, [destination_file or dest_dir])`

In [22]:
# copy file example using shutil.copy()
import os
import shutil

source = 'current/test/test.py' ## path to source file
target = '/prod/new'  ## path to target file

assert not os.path.isabs(source)
target = os.path.join(target, os.path.dirname(source))

# create the folders if not already exists
os.makedirs(target)

# adding exception handling
try:
    shutil.copy(source, target)
except IOError as e:
    print("Unable to copy file. %s" % e)
except:
    print("Unexpected error:", sys.exc_info())

Unable to copy file. [Errno 2] No such file or directory: 'current/test/test.py'


## Running External Processes -- subprocess

The [subprocess module](https://docs.python.org/3.7/library/subprocess.html#module-subprocess) enables you to start new applications from your Python program. How cool is that?

You can start a process in Python using the Popen function call. The program below starts the unix program ‘cat’ and the second parameter is the argument. This is equivalent to ‘cat test.py’.  You can start any program with any parameter.


In [0]:
from subprocess import Popen, PIPE
 
process = Popen(['cat', 'test.py'], stdout=PIPE, stderr=PIPE, universal_newlines=True)
stdout, stderr = process.communicate()
print (stdout) ## prints the contents of the file 'test.py'

The `process.communicate()` call reads input and output from the process.  `stdout` is the process output. `stderr` will be written only if an error occurs.  If you want to wait for the program to finish you can call `Popen.wait()`.

### Subprocess call():
Subprocess has a method call() which can be used to start a program. The parameter is a list of which the first argument must be the program name. The full definition is:

`subprocess.call(args, *, stdin=None, stdout=None, stderr=None, shell=False)`

* Runs the command described by args. 
* Waits for command to complete, then return the returncode attribute.

In the example below the full command would be `“ls -l”`

In [25]:
import subprocess
subprocess.call(["ls", "-l"])  ## prints the files in current working directory

0

### Save process output (stdout)
We can get the output of a program and store it in a string directly using check_output. The method is defined as:

`subprocess.check_output(args, *, stdin=None, stderr=None, shell=False, universal_newlines=False)`
 
 * Runs command with arguments and return its output as a byte string.


In [27]:
import subprocess
 
s = subprocess.check_output(["echo", "Hello World!"])
print(s)  ## b'Hello World!\n'

b'Hello World!\n'


# Exceptions

An exception represents a run-time error that halts the normal execution at a particular line and transfers control to error handling code. This section just introduces the most basic uses of exceptions. For example a run-time error might be that a variable used in the program does not have a value (ValueError .. you've probably seen that one a few times), or a file open operation error because a file does not exist (IOError). Learn more in the [exceptions tutorial](https://docs.python.org/3/tutorial/errors.html) and see the [entire exception list](https://docs.python.org/3/library/exceptions.html).

Without any error handling code (as we have done thus far), a run-time exception just halts the program with an error message. That's a good default behavior, and you've seen it many times. 

You can add a "try/except" structure to your code to handle exceptions, like this:

* The try: section includes the code which might throw an exception. 
* The except: section holds the code to run if there is an exception. 
* If there is no exception, the except: section is skipped (that is, that code is for error handling only, not the "normal" case for the code). 
* You can get a pointer to the exception object itself with syntax "except IOError as e: .." (e points to the exception object).



In [33]:
import sys
filename = 'hello.py'
try:
  ## Either of these two lines could throw an IOError, say
  ## if the file does not exist or the read() encounters a low level error.
  f = open(filename, 'r')
  text = f.read()
  f.close()
except IOError:
  ## Control jumps directly to here if any of the above lines throws IOError.
  sys.stderr.write('problem reading:' + filename)
## In any case, the code then continues with the line after the try/except
finally:
  ## whatever you do here gets executed always, irrespective of whether exception is thrown or not
  print("hello from finally block!!")

problem reading:hello.py

# Assignment 2 - File System and External Commands:
To practice the file system and external-commands concepts, see the Copy Special Exercise

This exercise is in the "copyspecial" directory [here](https://github.com/screel-labs/ml-lite/tree/master/Python-Tutorials/exercises). Edit the code in copyspecial.py.

The copyspecial.py program takes one or more directories as its arguments. We'll say that a "special" file is one where the name contains the pattern \__w__ somewhere, where the w is one or more word chars. The provided main() includes code to parse the command line arguments, but the rest is up to you. Write functions to implement the features below and modify main() to call your functions.

Suggested functions for your solution:

* get_special_paths(dir) -- returns a list of the absolute paths of the special files in the given directory
* copy_to(paths, dir) given a list of paths, copies those files into the given directory
* zip_to(paths, zippath) given a list of paths, zip those files up into the given zipfile

### Part A (manipulating file paths)

Gather a list of the absolute paths of the special files in all the directories. In the simplest case, just print that list (here the "." after the command is a single argument indicating the current directory). Print one absolute path per line.

We'll assume that names are not repeated across the directories (*optional: check that assumption and error out if it's violated*).

**From terminal, run: **

> `>> ./copyspecial.py .`

**Expected output:**

> /Users/nparlante/pycourse/day2/xyz\__hello__.txt

> /Users/nparlante/pycourse/day2/zz\__something__.jpg

### Part B (file copying)

If the "--todir dir" option is present at the start of the command line, do not print anything and instead copy the files to the given directory, creating it if necessary. Use the python module "shutil" for file copying.

**From terminal, run: **

> `>> ./copyspecial.py --todir /tmp/fooby .`

> `>> ls /tmp/fooby`

**Expected output:**

> xyz\__hello__.txt          

> zz\__something__.jpg

### Part C (calling an external program)

* If the `"--tozip zipfile"` option is present at the start of the command line, run this command: `zip -j zipfile <list all the files>`. This will create a zipfile containing the files. Just for fun/reassurance, also print the command line you are going to do first.
 
 **From terminal, run: **
 
> `>> ./copyspecial.py --tozip tmp.zip .`
 
 **Expected output:**
 
> Command I'm going to do:zip -j tmp.zip /Users/nparlante/pycourse/day2/xyz\__hello\__.txt  /Users/nparlante/pycourse/day2/zz\__something__.jpg


* If the child process exits with an error code, exit with an error code and print the command's output. Test this by trying to write a zip file to a directory that does not exist.

 **From terminal, run: **
 
> `>> ./copyspecial.py --tozip /no/way.zip .`
 
 **Expected output:**
 
> Command I'm going to do:zip -j /no/way.zip /Users/nparlante/pycourse/day2/xyz\__hello\__.txt
/Users/nparlante/pycourse/day2/zz\__something\__.jpg

> zip I/O error: No such file or directory 

> zip error: Could not create output file (/no/way.zip)

*Windows note: Windows does not come with a program to produce standard .zip archives by default, but you can get download the free and open zip program from [www.info-zip.org](www.info-zip.org).*