# Regular Expressions

Regular expressions are text-matching patterns described with a formal syntax. You'll often hear regular expressions referred to as 'regex' or 'regexp' in conversation. Regular expressions can include a variety of rules, from finding repetition, to text-matching, and much more. As you advance in Python you'll see that a lot of your parsing problems can be solved with regular expressions (they're also a common interview question!).


We will be using the <code>re</code> module with Python for this lecture.


## Searching for Patterns in Text

One of the most common uses for the re module is for finding patterns in text. Let's do a quick example of using the search method in the re module to find some text:

In [0]:
import re

#Search()

The search() function searches the string for a match, and returns a Match object if there is a match.

For example: Search the string to see if it starts with "The" and ends with "Spain":

In [2]:
txt = "A lot of people in the Spain are in the pain due to the Covid-19" 
#match_obj = re.search("^A.*19$", txt) 
match_obj = re.search("Spain", txt) 
print(match_obj)

<_sre.SRE_Match object; span=(23, 28), match='Spain'>


### Note: search method returns a search object.

If no matches are found, the value None is returned:

In [3]:
txt = "A lot of people in the Spain are in the pain due to the Covid-19" 
#match_obj = re.search("^A.*Covid$", txt) 
match_obj = re.search("USA", txt) 
print(match_obj)

None


If there is more than one match, only the first occurrence of the match will be returned:

In [4]:
txt = "A lot of people in the Spain are in the pain due to the Covid-19" 
match_obj = re.search("in", txt) 
print(match_obj)

<_sre.SRE_Match object; span=(16, 18), match='in'>


In [5]:
# Text to parse
text = "A lot of people in the Spain are in the pain due to the Covid-19" 

# List of patterns to search for
pattern = 'Spain'

match_obj = re.search(pattern,text)

type(match_obj)

_sre.SRE_Match

# Match Object


Now we've seen that <code>re.search()</code> will take the pattern, scan the text, and then return a **Match** object. 

This **Match** object returned by the search() method is more than just a Boolean or None, it contains information about the match, including the original input string, the regular expression that was used, and the location of the match. Let's see the methods we can use on the match object:


In [6]:
print(match_obj)
print(type(match_obj))

<_sre.SRE_Match object; span=(23, 28), match='Spain'>
<class '_sre.SRE_Match'>



The Match object has properties and methods used to retrieve information about the search, and the result:

    .start() returns the start position of the first match
    
    .end() returns the end position of the first match
    
    .span() returns a tuple containing the start-, and end positions of the match.

    .group() returns the part of the string where there was a match

    .string returns the string passed into the function


To give a clearer picture of this match object, check out the cell below:


In [7]:
print(f'\nShow start of match------------------------------------------------------------ { match_obj.start() }')

print(f'\nShow end of match-------------------------------------------------------------- { match_obj.end() }')

print(f'\nreturns a tuple containing the start-, and end positions of the match --------- { match_obj.span() }')

print(f'\nreturns the part of the string where there was a match ------------------------ { match_obj.group()}')

print(f'\nreturns the string passed into the function ----------------------------------- { match_obj.string}')


Show start of match------------------------------------------------------------ 23

Show end of match-------------------------------------------------------------- 28

returns a tuple containing the start-, and end positions of the match --------- (23, 28)

returns the part of the string where there was a match ------------------------ Spain

returns the string passed into the function ----------------------------------- A lot of people in the Spain are in the pain due to the Covid-19


# Search multiple words/patterns in the text

In [13]:
# Text to parse
text = "A lot of people in the Spain are in the pain due to the Covid-19" 

# List of patterns to search for
patterns = ['in', 'the', 'USA']

for pattern in patterns:
    print(f'Searching for {pattern} inside the text --------- {text}\n')
    
    #Check for match
    if re.search(pattern,text):
        print('Yeah...... Match was found. \n\n')
    else:
        print('Oops...... No Match was found.\n\n')

Searching for in inside the text --------- A lot of people in the Spain are in the pain due to the Covid-19

Yeah...... Match was found. 


Searching for the inside the text --------- A lot of people in the Spain are in the pain due to the Covid-19

Yeah...... Match was found. 


Searching for USA inside the text --------- A lot of people in the Spain are in the pain due to the Covid-19

Oops...... No Match was found.




# findall()

The findall() function returns a list containing all matches.


In [11]:
# Text to parse
text = "A lot of people in the Spain are in the pain due to the Covid-19" 

# List of patterns to search for
pattern = 'in'

#match_obj = re.search(pattern,text)
match_list = re.findall(pattern,text)

print(match_list)
print(len(match_list))
print(type(match_list) ) 

['in', 'in', 'in', 'in']
4
<class 'list'>


The list contains the matches in the order they are found.


**Note:** Return an empty list if no match was found:

In [12]:
# Text to parse
text = "A lot of people in the Spain are in the pain due to the Covid-19" 

# List of patterns to search for
pattern = 'USA'

#match_obj = re.search(pattern,text)
match_list = re.findall(pattern,text)

print(match_list)
print(type(match_list) )

[]
<class 'list'>


## Split()


Let's see how we can split with the re syntax. This should look similar to how you used the split() method with strings.

In [14]:
# Term to split on
split_term = '@'

phrase = 'What is the domain name of someone with the email: hello@example.com'

# Split the phrase using python string method
print( phrase.split(split_term) )

# Split the phrase using re module
print( re.split(split_term,phrase) )

['What is the domain name of someone with the email: hello', 'example.com']
['What is the domain name of someone with the email: hello', 'example.com']


Note how <code>re.split()</code> returns a list with the term to split on removed and the terms in the list are a split up version of the string. Create a couple of more examples for yourself to make sure you understand!



# sub()

The sub() function replaces the matches with the text of your choice. For example: Replace every white-space character with the "*":



In [15]:
# Text to parse
text = "A lot of people in the Spain are in the pain due to the Covid-19" 

# List of patterns to search for
pattern = 'in'

#match_obj = re.search(pattern,text)
#match_list = re.findall(pattern,text)
replace_with_sub = re.sub(pattern, "*" , text)

print(replace_with_sub)

A lot of people * the Spa* are * the pa* due to the Covid-19


You can control the number of replacements by specifying the count parameter

In [16]:
# Text to parse
text = "A lot of people in the Spain are in the pain due to the Covid-19" 

# List of patterns to search for
pattern = 'in'

#match_obj = re.search(pattern,text)
#match_list = re.findall(pattern,text)
replace_with_sub = re.sub(pattern, "*" , text,2 )

print(replace_with_sub)

A lot of people * the Spa* are in the pain due to the Covid-19


## Metacharacters

Regular expressions support a huge variety of patterns beyond just simply finding where a single string occurred. This can be done by several ways.

Metacharacters are characters with a special meaning.


Since we will be testing multiple re syntax forms, let's create a function that will print out results given a list of various regular expressions and a phrase to parse:

In [0]:
def multi_re_find(patterns,phrase):
    '''
    Takes in a list of regex patterns
    Prints a list of all matches
    '''
    for pattern in patterns:
        print(f'Searching the phrase using the re check: {pattern}')
        print(re.findall(pattern,phrase))
        print('\n')

### Repetition Syntax

There are five ways to express repetition in a pattern:


   1. A pattern followed by the meta-character <code>*</code> is repeated zero or more times. 
   2. Replace the <code>*</code> with <code>+</code> and the pattern must appear at least once. 
   3. Using <code>?</code> means the pattern appears zero or one time. 
   4. For a specific number of occurrences, use <code>{m}</code> after the pattern, where **m** is replaced with the number of times the pattern should repeat. 
   5. Use <code>{m,n}</code> where **m** is the minimum number of repetitions and **n** is the maximum. Leaving out **n** <code>{m,}</code> means the value appears at least **m** times, with no maximum.
    
Now we will see an example of each of these using our multi_re_find function:

In [19]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

test_patterns = [ 'sd*',     # s followed by zero or more d's
                'sd+',          # s followed by one or more d's
                'sd?',          # s followed by zero or one d's
                'sd{3}',        # s followed by three d's
                'sd{2,3}',      # s followed by two to three d's
                ]

multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: sd*
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sd', 's', 's', 's', 's', 's', 's', 'sdddd']


Searching the phrase using the re check: sd+
['sd', 'sd', 'sddd', 'sddd', 'sddd', 'sd', 'sdddd']


Searching the phrase using the re check: sd?
['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd']


Searching the phrase using the re check: sd{3}
['sddd', 'sddd', 'sddd', 'sddd']


Searching the phrase using the re check: sd{2,3}
['sddd', 'sddd', 'sddd', 'sddd']




In [18]:
# Text to parse
test_phrase = "A lot of people in the Spain are in the pain due to the Covid-19. Oh God!!! Give the relaxaition now." 

# List of patterns to search for
test_patterns = ['ai*',         # a followed by zero or more i's
                'ai+',          # a followed by one or more i's
                'ai?',          # a followed by zero or one i's
                'ai{1,3}',      # a followed by one to three i's
                ]

multi_re_find(test_patterns,test_phrase)


Searching the phrase using the re check: ai*
['ai', 'a', 'ai', 'a', 'ai']


Searching the phrase using the re check: ai+
['ai', 'ai', 'ai']


Searching the phrase using the re check: ai?
['ai', 'a', 'ai', 'a', 'ai']


Searching the phrase using the re check: ai{1,3}
['ai', 'ai', 'ai']




## Character Sets

Character sets are used when you wish to match any one of a group of characters at a point in the input. Brackets are used to construct character set inputs. For example: the input <code>[ab]</code> searches for occurrences of either **a** or **b**.
Let's see some examples:

In [22]:
test_phrase = "A lot of people in the Spain are in the pain due to the Covid-19. Oh God!!! Give the relaxaition now." 

test_patterns = ['[of]',    # either o or f
                'o[tfp]+']   # o followed by one or more t or f

multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: [of]
['o', 'o', 'f', 'o', 'o', 'o', 'o', 'o', 'o']


Searching the phrase using the re check: o[tfp]+
['ot', 'of', 'op']




It makes sense that the first input <code>[of]</code> returns every instance of o or f. Also, the second input <code>o[tf]+</code> returns any full strings that begin with an o and continue with t or f characters until another character is reached.

## Exclusion

We can use <code>^</code> to exclude terms by incorporating it into the bracket syntax notation. For example: <code>[^...]</code> will match any single character not in the brackets. Let's see some examples:

In [0]:
#test_phrase = 'This is a string! But it has punctuation. How can we remove it?'
test_phrase = "A lot of people in the Spain are in the pain due to the Covid-19. Oh God!!! Give the relaxaition now. Please! Please! Please!" 


Use <code>[^!.? ]</code> to check for matches that are not a !,.,?, or space. Add a <code>+</code> to check that the match appears at least once. This basically translates into finding the words.

In [25]:
re.findall('[^!.? ]+',test_phrase)

['A',
 'lot',
 'of',
 'people',
 'in',
 'the',
 'Spain',
 'are',
 'in',
 'the',
 'pain',
 'due',
 'to',
 'the',
 'Covid-19',
 'Oh',
 'God',
 'Give',
 'the',
 'relaxaition',
 'now']

## Character Ranges

As character sets grow larger, typing every character that should (or should not) match could become very tedious. A more compact format using character ranges lets you define a character set to include all of the contiguous characters between a start and stop point. The format used is <code>[start-end]</code>.

Common use cases are to search for a specific range of letters in the alphabet. For instance, <code>[a-f]</code> would return matches with any occurrence of letters between a and f. 

Let's walk through some examples:

In [27]:

#test_phrase = 'This is an example sentence. Lets see if we can find some letters.'
test_phrase = "A lot of people in the Spain are in the pain due to the Covid-19. Oh God!!! Give the relaxaition now." 


test_patterns=['[a-z]+',      # sequences of lower case letters
               '[A-Z]+',      # sequences of upper case letters
               '[a-zA-Z]+',   # sequences of lower or upper case letters
               '[A-Z][a-z]+'] # one upper case letter followed by lower case letters
                
multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: [a-z]+
['lot', 'of', 'people', 'in', 'the', 'pain', 'are', 'in', 'the', 'pain', 'due', 'to', 'the', 'ovid', 'h', 'od', 'ive', 'the', 'relaxaition', 'now']


Searching the phrase using the re check: [A-Z]+
['A', 'S', 'C', 'O', 'G', 'G']


Searching the phrase using the re check: [a-zA-Z]+
['A', 'lot', 'of', 'people', 'in', 'the', 'Spain', 'are', 'in', 'the', 'pain', 'due', 'to', 'the', 'Covid', 'Oh', 'God', 'Give', 'the', 'relaxaition', 'now']


Searching the phrase using the re check: [A-Z][a-z]+
['Spain', 'Covid', 'Oh', 'God', 'Give']




## Escape Codes

You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits, whitespace, and more. For example:

<table border="1" class="docutils">
<colgroup>
<col width="14%" />
<col width="86%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Code</th>
<th class="head">Meaning</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\d</span></tt></td>
<td>a digit</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\D</span></tt></td>
<td>a non-digit</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\s</span></tt></td>
<td>whitespace (tab, space, newline, etc.)</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\S</span></tt></td>
<td>non-whitespace</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\w</span></tt></td>
<td>alphanumeric</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\W</span></tt></td>
<td>non-alphanumeric</td>
</tr>
</tbody>
</table>

Escapes are indicated by prefixing the character with a backslash <code> "\\"</code>. Unfortunately, a backslash must itself be escaped in normal Python strings, and that results in expressions that are difficult to read. Using raw strings, created by prefixing the literal value with <code>r</code>, eliminates this problem and maintains readability.

Personally, I think this use of <code>r</code> to escape a backslash is probably one of the things that block someone who is not familiar with regex in Python from being able to read regex code at first. Hopefully after seeing these examples this syntax will become clear.

In [29]:
#test_phrase = 'This is a string with some numbers 1233 and a symbol #hashtag'
test_phrase = "A lot of people in the Spain are in the pain due to the Covid-19. Oh God!!! Give the relaxaition now." 
#\n - new line character
#\t - new tab

test_patterns=[ r'\d+', # sequence of digits
                r'\D+', # sequence of non-digits
                r'\s+', # sequence of whitespace
                r'\S+', # sequence of non-whitespace
                r'\w+', # alphanumeric characters
                r'\W+', # non-alphanumeric!
                ]

multi_re_find(test_patterns,test_phrase)

Searching the phrase using the re check: \d+
['19']


Searching the phrase using the re check: \D+
['A lot of people in the Spain are in the pain due to the Covid-', '. Oh God!!! Give the relaxaition now.']


Searching the phrase using the re check: \s+
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


Searching the phrase using the re check: \S+
['A', 'lot', 'of', 'people', 'in', 'the', 'Spain', 'are', 'in', 'the', 'pain', 'due', 'to', 'the', 'Covid-19.', 'Oh', 'God!!!', 'Give', 'the', 'relaxaition', 'now.']


Searching the phrase using the re check: \w+
['A', 'lot', 'of', 'people', 'in', 'the', 'Spain', 'are', 'in', 'the', 'pain', 'due', 'to', 'the', 'Covid', '19', 'Oh', 'God', 'Give', 'the', 'relaxaition', 'now']


Searching the phrase using the re check: \W+
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '-', '. ', ' ', '!!! ', ' ', ' ', ' ', '.']




## Metacharacters

Metacharacters are characters with a special meaning

    [] 	A set of characters 	"[a-m]" 	
    \ 	Signals a special sequence (can also be used to escape special characters) 	"\d" 	
    . 	Any character (except newline character) 	"he..o" 	
    ^ 	Starts with 	"^hello" 	
    $ 	Ends with 	"world$" 	
    * 	Zero or more occurrences 	"aix*" 	
    + 	One or more occurrences 	"aix+" 	
    {} 	Exactly the specified number of occurrences 	"al{2}" 	
    | 	Either or 	"falls|stays" 	
    () 	Capture and group

#Special Sequences

A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:


    Character 	Description 	Example 	
    \A 	Returns a match if the specified characters are at the beginning of the string 	"\AThe" 	
    \b 	Returns a match where the specified characters are at the beginning or at the end of a word
    (the "r" in the beginning is making sure that the string is being treated as a "raw string") 	r"\bain"
    r"ain\b" 	

    \B 	Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word
    (the "r" in the beginning is making sure that the string is being treated as a "raw string") 	r"\Bain"
    r"ain\B" 	

    \d 	Returns a match where the string contains digits (numbers from 0-9) 	"\d" 	
    \D 	Returns a match where the string DOES NOT contain digits 	"\D" 	
    \s 	Returns a match where the string contains a white space character 	"\s" 	
    \S 	Returns a match where the string DOES NOT contain a white space character 	"\S" 	
    \w 	Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character) 	"\w" 	
    \W 	Returns a match where the string DOES NOT contain any word characters 	"\W" 	
    \Z 	Returns a match if the specified characters are at the end of the string

#Sets

A set is a set of characters inside a pair of square brackets [] with a special meaning:


    Set 	Description 	
    [arn] 	Returns a match where one of the specified characters (a, r, or n) are present 	
    [a-n] 	Returns a match for any lower case character, alphabetically between a and n 	
    [^arn] 	Returns a match for any character EXCEPT a, r, and n 	
    [0123] 	Returns a match where any of the specified digits (0, 1, 2, or 3) are present 	
    [0-9] 	Returns a match for any digit between 0 and 9 	
    [0-5][0-9] 	Returns a match for any two-digit numbers from 00 and 59 	
    [a-zA-Z] 	Returns a match for any character alphabetically between a and z, lower case OR upper case 	
    [+] 	In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string

## Conclusion

You should now have a solid understanding of how to use the regular expression module in Python. There are a ton of more special character instances, but it would be unreasonable to go through every single use case. Instead take a look at the full [documentation](https://docs.python.org/3/library/re.html#regular-expression-syntax) if you ever need to look up a particular pattern.

You can also check out the nice summary tables at these [source1](http://www.tutorialspoint.com/python/python_reg_expressions.htm) ,  [source2](https://www.w3schools.com/python/python_regex.asp).

