# Regular Expressions

Regular expressions are text matching patterns described with a formal syntax. You'll often hear regular expressions referred to as 'regex' or 'regexp' in conversation. Regular expressions can include a variety of rules, fro finding repetition, to text-matching, and much more. As you advance in Python you'll see that a lot of your parsing problems can be solved with regular expressions (they're also a common interview question!).


If you're familiar with Perl, you'll notice that the syntax for regular expressions are very similar in Python. We will be using the re module with Python for this lecture.


Let's get started!

## Searching for Patterns in Text

One of the most common uses for the re module is for finding patterns in text. Let's do a quick example of using the search method in the re module to find some text:

In [2]:
import re
text = " This is a string with python, but it does not have javascript."
patterns = ['python', 'php']

for pattern in patterns:
    print(f"Searching for {pattern} in: \n{text}")
    
    #Check for match
    if re.search(pattern, text):
        print()
        print("Match was found.\n")
    else:
        print()
        print("No Match was found.\n")

Searching for python in: 
 This is a string with python, but it does not have javascript.

Match was found.

Searching for php in: 
 This is a string with python, but it does not have javascript.

No Match was found.



Now we've seen that re.search() will take the pattern, scan the text, and then returns a **Match** object. If no pattern is found, a **None** is returned. To give a clearer picture of this match object, check out the cell below:

In [7]:
pattern = 'JavaScript'
text = "is the world's most popular programming language.\
JavaScript is the programming language of the Web.\
JavaScript is easy to learn.\
This tutorial will teach you JavaScript from basic to advanced."

match = re.search(pattern, text)
type(match)

re.Match

This **Match** object returned by the search() method is more than just a Boolean or None, it contains information about the match, including the original input string, the regular expression that was used, and the location of the match. Let's see the methods we can use on the match object:

In [8]:
# Show match position
match.span()

(49, 59)

In [9]:
match.start()

49

In [10]:
match.end()

59

## Split with regular expressions

Let's see how we can split with the re syntax. This should look similar to how you used the split() method with strings.

In [11]:
splitter = "@"
phrase = "Domain name of google is www.google.com and email is hello@gmail.com. For youtube, its www.youtube.com and email is hello@youtube.com"

re.split(splitter, phrase)

['Domain name of google is www.google.com and email is hello',
 'gmail.com. For youtube, its www.youtube.com and email is hello',
 'youtube.com']

In [12]:
re.split(splitter, phrase, 1)

['Domain name of google is www.google.com and email is hello',
 'gmail.com. For youtube, its www.youtube.com and email is hello@youtube.com']

Note how re.split() returns a list with the term to spit on removed and the terms in the list are a split up version of the string. Create a couple of more examples for yourself to make sure you understand!

## Finding all instances of a pattern

You can use re.findall() to find all the instances of a pattern in a string. For example:

In [15]:
pattern = 'JavaScript'
text = "JavaScript is the world's most popular programming language.\
JavaScript is the programming language of the Web.\
JavaScript is easy to learn.\
This tutorial will teach you JavaScript from basic to advanced."

In [16]:
re.findall(pattern, text)

['JavaScript', 'JavaScript', 'JavaScript', 'JavaScript']

## Pattern re Syntax

This will be the bulk of this lecture on using re with Python. Regular expressions supports a huge variety of patterns the just simply finding where a single string occurred. 

We can use *metacharacters* along with re to find specific types of patterns. 

Since we will be testing multiple re syntax forms, let's create a function that will print out results given a list of various regular expressions and a phrase to parse:

In [17]:
def multi_re_find(patterns, phrase):
    '''
    Takes in a list of regex patterns
    Prints a list of all matches
    '''
    for pattern in patterns:
        print(f"Searching the phrase using the re check: {pattern}")
        print(re.findall(pattern, phrase))
        print()
        
        

### Repetition Syntax

There are five ways to express repetition in a pattern:

    1.) A pattern followed by the meta-character * is repeated zero or more times. 
    2.) Replace the * with + and the pattern must appear at least once. 
    3.) Using ? means the pattern appears zero or one time. 
    4.) For a specific number of occurrences, use {m} after the pattern, where m is replaced with the number of times         the pattern should repeat. 
    5.) Use {m,n} where m is the minimum number of repetitions and n is the maximum. Leaving out n ({m,}) means the           value appears at least m times, with no maximum.
    
Now we will see an example of each of these using our multi_re_find function:

In [20]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

test_patterns = ['sd*', 'sd+', 'sd?', 'sd{3}', 'sd{2,3}']

multi_re_find(test_patterns, test_phrase)

Searching the phrase using the re check: sd*
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sd', 's', 's', 's', 's', 's', 's', 'sdddd']

Searching the phrase using the re check: sd+
['sd', 'sd', 'sddd', 'sddd', 'sddd', 'sd', 'sdddd']

Searching the phrase using the re check: sd?
['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd']

Searching the phrase using the re check: sd{3}
['sddd', 'sddd', 'sddd', 'sddd']

Searching the phrase using the re check: sd{2,3}
['sddd', 'sddd', 'sddd', 'sddd']



## Character Sets

Character sets are used when you wish to match any one of a group of characters at a point in the input. Brackets are used to construct character set inputs. For example: the input [ab] searches for occurrences of either a or b.
Let's see some examples:

In [23]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'
test_patterns = ['[sd]',#either s or d
                 's[sd]+']

multi_re_find(test_patterns, test_phrase)

Searching the phrase using the re check: [sd]
['s', 'd', 's', 'd', 's', 's', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd', 's', 'd', 's', 'd', 's', 's', 's', 's', 's', 's', 'd', 'd', 'd', 'd']

Searching the phrase using the re check: s[sd]+
['sdsd', 'sssddd', 'sdddsddd', 'sds', 'sssss', 'sdddd']



It makes sense that the first [sd] returns every instance. Also the second input will just return any thing starting with an s in this particular case of the test phrase input.

## Exclusion

We can use ^ to exclude terms by incorporating it into the bracket syntax notation. For example: [^...] will match any single character not in the brackets. Let's see some examples:

In [24]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

Use [^!.? ] to check for matches that are not a !,.,?, or space. Add the + to check that the match appears at least once, this basically translate into finding the words.

In [25]:
test_pattern = '[^!.?]+'
re.findall(test_pattern, test_phrase)

['This is a string', ' But it has punctuation', ' How can we remove it']

## Character Ranges

As character sets grow larger, typing every character that should (or should not) match could become very tedious. A more compact format using character ranges lets you define a character set to include all of the contiguous characters between a start and stop point. The format used is [start-end].

Common use cases are to search for a specific range of letters in the alphabet, such [a-f] would return matches with any instance of letters between a and f. 

Let's walk through some examples:

In [27]:
test_phrase = "This is an example sentence. Let's see if we can find some letters."
test_patterns = ['[a-z]+',  # sequence of lower case letters
                '[A-Z]+',  # sequence of upper case letters
                '[a-zA-Z]+', # sequence of lower or upper case letters
                '[A-Z][a-z]+'] # one upper case letter followed by lowe case letters
                 
multi_re_find(test_patterns, test_phrase)

Searching the phrase using the re check: [a-z]+
['his', 'is', 'an', 'example', 'sentence', 'et', 's', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']

Searching the phrase using the re check: [A-Z]+
['T', 'L']

Searching the phrase using the re check: [a-zA-Z]+
['This', 'is', 'an', 'example', 'sentence', 'Let', 's', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']

Searching the phrase using the re check: [A-Z][a-z]+
['This', 'Let']



In [32]:
test_phrase = 'Python was conceived in the late 1980s[42] by Guido van Rossum at Centrum Wiskunde & Informatica (CWI) in the Netherlands as a successor to the ABC programming language, which was inspired by SETL,[43] capable of exception handling (from the start plus new capabilities in Python 3.11) and interfacing with the Amoeba operating system.[13] Its implementation began in December 1989.[44] Van Rossum shouldered sole responsibility for the project, as the lead developer, until 12 July 2018, when he announced his "permanent vacation" from his responsibilities as Python\'s "benevolent dictator for life", a title the Python community bestowed upon him to reflect his long-term commitment as the project\'s chief decision-maker.[45] In January 2019, active Python core developers elected a five-member Steering Council to lead the project.[46][47]'
test_pattern = '[0-9]{4}'
re.findall(test_pattern, test_phrase)

['1980', '1989', '2018', '2019']

## Escape Codes

You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits,whitespace, and more. For example:

<table border="1" class="docutils">
<colgroup>
<col width="14%" />
<col width="86%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Code</th>
<th class="head">Meaning</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\d</span></tt></td>
<td>a digit</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\D</span></tt></td>
<td>a non-digit</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\s</span></tt></td>
<td>whitespace (tab, space, newline, etc.)</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\S</span></tt></td>
<td>non-whitespace</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\w</span></tt></td>
<td>alphanumeric</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\W</span></tt></td>
<td>non-alphanumeric</td>
</tr>
</tbody>
</table>

Escapes are indicated by prefixing the character with a backslash (\). Unfortunately, a backslash must itself be escaped in normal Python strings, and that results in expressions that are difficult to read. Using raw strings, created by prefixing the literal value with r, for creating regular expressions eliminates this problem and maintains readability.

Personally, I think this use of r to escape a backslash is probably one of the things that block someone who is not familiar with regex in Python from being able to read regex code at first. Hopefully after seeing these examples this syntax will become clear.

In [34]:
test_phrase = 'This is a string with some numbers 123355 and a symbol #hashtag'
test_patterns = ['\d+', #sequence of digits
                '\D+', # sequence of non=digits
                '\s+', # sequence of whitespace
                '[\S+]', # sequence of non-whitespace
                '\w+', # alphanumeric characters
                '\W+'#non alphanueric
                ]
                 
multi_re_find(test_patterns, test_phrase)

Searching the phrase using the re check: \d+
['123355']

Searching the phrase using the re check: \D+
['This is a string with some numbers ', ' and a symbol #hashtag']

Searching the phrase using the re check: \s+
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

Searching the phrase using the re check: [\S+]
['T', 'h', 'i', 's', 'i', 's', 'a', 's', 't', 'r', 'i', 'n', 'g', 'w', 'i', 't', 'h', 's', 'o', 'm', 'e', 'n', 'u', 'm', 'b', 'e', 'r', 's', '1', '2', '3', '3', '5', '5', 'a', 'n', 'd', 'a', 's', 'y', 'm', 'b', 'o', 'l', '#', 'h', 'a', 's', 'h', 't', 'a', 'g']

Searching the phrase using the re check: \w+
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '123355', 'and', 'a', 'symbol', 'hashtag']

Searching the phrase using the re check: \W+
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' #']



## Conclusion

You should now have a solid understanding of how to use the regular expression module in Python. There are a ton of more special character instances, but it would be unreasonable to go through every single use case. Instead take a look at the full [documentation](https://docs.python.org/2/library/re.html#regular-expression-syntax) if you ever need to look up a particular case.

You can also check out the nice summary tables at this [source](http://www.tutorialspoint.com/python/python_reg_expressions.htm).

Good job!
