In this section, we give a ultra brief introduction to regular expressions. More advanced topics (and comprehensive notes) can be found in the book 'Automate the Boring Stuff with Python' (online book URL at: https://automatetheboringstuff.com/).

Regular expressions are text matching patterns described with a formal syntax. You'll often hear regular expressions referred to as 'regex' or 'regexp' in conversation. Regular expressions can include a variety of rules, from finding repetition, to text-matching, and much more. As you advance in Python skills you'll see that a lot of your parsing problems can be solved using regular expressions. Regular expressions are also widely used in UNIX world.

If you're familiar with Perl, you'll notice that the syntax for regular expressions are very similar in Python. We will be using the 're' module with Python for this lecture.

One of the most common uses for the re module is for finding patterns in text. Let's do a quick example of using the search method in the 're' module to find some text:

In [6]:
import re
patterns = ['term1', 'term2'] # list of patterns to search for
text = 'This is a string with term1, but it does not have the other term.'
for pattern in patterns:
    print('Searching for "%s" in: \n"%s"' % (pattern, text))
    if re.search(pattern, text):
        print('Match was found. \n')
    else:
        print('No Match was found.\n')
        
type(re.search('term1', text))

Searching for "term1" in: 
"This is a string with term1, but it does not have the other term."
Match was found. 

Searching for "term2" in: 
"This is a string with term1, but it does not have the other term."
No Match was found.



_sre.SRE_Match

The result of the search() method is a 'Match' object. This 'Match' object returned by the search() method is more than just a Boolean or None, it contains information about the match, including the original input string, the regular expression that was used, and the location of the match. Let's see the methods we can use on the 'Match' object:

In [2]:
pattern2 = 'term1'
text2 = 'This is a string with term1, but it does not have the other term.'
match2 = re.search(pattern2,  text2)
match2 = re.search(pattern2,  text2)
print(match2.start())
print(match2.end())

22
27


Of course, if you use press the tab button as usual, Python Jupyter notebook will give you a list of methods associated with the 'Match' object. In general, Python offers two different primitive operations based on regular expressions: match-checks for a match only at the beginning of the string, while search-checks for a match anywhere in the string (this is what Perl does by default).

Now moving to a different topic, let's see how we can split with the 're' syntax. This should look similar to how you used the split() method with strings:

In [7]:
split_term = '@'
phrase = 'What is the domain name of someone with the email: JohnDoe@gmail.com'
re.split(split_term,phrase)

['What is the domain name of someone with the email: JohnDoe', 'gmail.com']

You can use re.findall() method to find all the instances of a pattern in a string. For example:

In [4]:
re.findall('match-phrase','test phrase match-phrase1 is in middle, there is indeed a match-phrase2')

['match-phrase', 'match-phrase']

In general, regular expressions supports a huge variety of patterns, not merely simply finding where a single string occurred. The idea of regex is that we can use 'meta-characters' to find specific types of patterns.

To faciliate demonstrations for subsequent codes, since we will be testing multiple 're' syntax forms, let's create a user-defined function that will print out results given a list of various regular expressions and a phrase to parse for later convenience:

In [5]:
def multi_re_find(patterns,phrase):
    '''
    Takes in a list of regex patterns
    Prints a list of all matches
    '''
    for pattern in patterns:
        print('Searching the phrase using the re check: %r' %pattern)
        print(re.findall(pattern, phrase))

There are five ways to express repetition in a pattern roughly speaking: 1) a pattern followed by the meta-character 'asterisk' is repeated zero or more times. 2) Replace the asterisk with the 'plus sign' and the pattern must appear at least once. 3) Using the question mark means the pattern appears zero or one time. 4) For a specific number of occurrences, use {m} after the pattern, where m is replaced with the number of times the pattern should repeat. 5) Use {m,n} where m is the minimum number of repetitions and n is the maximum. Leaving out n ({m,}) means the value appears at least m times, with no maximum.

In [6]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'
test_patterns = ['sd*',          # s followed by zero or more d's
                 'sd+',          # s followed by one or more d's
                 'sd?',          # s followed by zero or one d's
                 'sd{3}',        # s followed by three d's
                 'sd{2,3}',      # s followed by two to three d's
                ]
multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: 'sd*'
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sd', 's', 's', 's', 's', 's', 's', 'sdddd']
Searching the phrase using the re check: 'sd+'
['sd', 'sd', 'sddd', 'sddd', 'sddd', 'sd', 'sdddd']
Searching the phrase using the re check: 'sd?'
['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd']
Searching the phrase using the re check: 'sd{3}'
['sddd', 'sddd', 'sddd', 'sddd']
Searching the phrase using the re check: 'sd{2,3}'
['sddd', 'sddd', 'sddd', 'sddd']


Character sets are used when you wish to match any one of a group of characters at a point in the input. Brackets are used to construct character set inputs. For example: the input [ab] searches for occurrences of either a or b. Let's see some examples:

In [7]:
test_phrase2 = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'
test_patterns2 = [ '[sd]',    # either s or d
                   's[sd]+']   # s followed by one or more s or d
multi_re_find(test_patterns2,test_phrase2)

Searching the phrase using the re check: '[sd]'
['s', 'd', 's', 'd', 's', 's', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd', 's', 'd', 's', 'd', 's', 's', 's', 's', 's', 's', 'd', 'd', 'd', 'd']
Searching the phrase using the re check: 's[sd]+'
['sdsd', 'sssddd', 'sdddsddd', 'sds', 'sssss', 'sdddd']


We can use the sign ^ to exclude terms by incorporating it into the bracket syntax notation. For example: [^...] will match any single character not in the brackets. Let's see some examples:

In [8]:
test_phrase3 = 'This is a string! But it has punctuation. How can we remove it?'
re.findall('[^!.? ]+',test_phrase3)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

In the above example, we used [^!.? ] to check for matches that are not a "!", ".", "?", or space. Adding the plus sign to check that the match appears at least once basically translate into finding the words.

As character sets grow larger, typing every character that should (or should not) match could become very tedious. A more compact format using character ranges lets you define a character set to include all of the contiguous characters between a start and stop point. The format used is [start-end].

Common use cases are to search for a specific range of letters in the alphabet, such [a-f] would return matches with any instance of letters between a and f. Let's walk through some examples:

In [9]:
test_phrase4 = 'This is an example sentence. Lets see if we can find some letters.'
test_patterns4=[ '[a-z]+',      # sequences of lower case letters
                '[A-Z]+',      # sequences of upper case letters
                '[a-zA-Z]+',   # sequences of lower or upper case letters
                '[A-Z][a-z]+'] # one upper case letter followed by lower case letters           
multi_re_find(test_patterns4,test_phrase)

Searching the phrase using the re check: '[a-z]+'
['sdsd', 'sssddd', 'sdddsddd', 'dsds', 'dsssss', 'sdddd']
Searching the phrase using the re check: '[A-Z]+'
[]
Searching the phrase using the re check: '[a-zA-Z]+'
['sdsd', 'sssddd', 'sdddsddd', 'dsds', 'dsssss', 'sdddd']
Searching the phrase using the re check: '[A-Z][a-z]+'
[]


You can use special escape characters to find specific types of patterns in your data, such as digits, non-digits, whitespace etc.. For example:
   1. \d for a digit
   2. \D for a non-digit
   3. \s for whitespace (tab, space, newline, etc.)
   4. \S for non-whitespace
   5. \w for alphanumeric
   6. \W for non-alphanumeric. 

Escapes are indicated by prefixing the character with a backslash sign: "\". Unfortunately, a backslash must itself be escaped in normal Python strings, and that results in expressions that are difficult to read. Using raw strings, created by prefixing the literal value with 'r', for creating regular expressions eliminates this problem and maintains readability:

In [10]:
test_phrase5 = 'This is a string with some numbers 1233 and a symbol #hashtag'
test_patterns5=[ r'\d+', # sequence of digits
                r'\D+', # sequence of non-digits
                r'\s+', # sequence of whitespace
                r'\S+', # sequence of non-whitespace
                r'\w+', # alphanumeric characters
                r'\W+', # non-alphanumeric
                ]
multi_re_find(test_patterns5,test_phrase5)

Searching the phrase using the re check: '\\d+'
['1233']
Searching the phrase using the re check: '\\D+'
['This is a string with some numbers ', ' and a symbol #hashtag']
Searching the phrase using the re check: '\\s+'
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
Searching the phrase using the re check: '\\S+'
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', '#hashtag']
Searching the phrase using the re check: '\\w+'
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', 'hashtag']
Searching the phrase using the re check: '\\W+'
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' #']
