In [1]:
import re

In [2]:
# Ordinary characters are the simplest regular expressions. 
# They match themselves exactly and do not have a special meaning in their regular expression syntax.
# Examples are 'A', 'a', 'X', '5'.

# Simple exact matches:

pattern = r"Cookie"
sequence = 'Cookie'
if re.match(pattern, sequence):
    print("Match!")
else:
    print("Not a match")

Match!


In [3]:
# The match() function returns a match object if the text matches the pattern. Otherwise, it returns None

pattern = r"Luis"
sequence = "luIs"
if re.match(pattern, sequence):
    print("Match!")
else:
    print("Not a match")

Not a match


The *r* at the start of the pattern Cookie is called a **raw string literal**. 
It changes how the string literal is interpreted. Such literals are stored as they appear.
For example, \ is just a backslash when prefixed with an *r* rather than being interpreted as an escape sequence. 

* We didn't actually need it for this example; however, it is a good practice to use it for consistency.*

## Wild Card Characters: Special Characters

Special characters are characters that do not match themselves as seen but have a special meaning when used in a regular expression.     
They can be thought of as *reserved metacharacters* that denote something else and not what they look like.

### **.**      
A period matches any single character, except the new line character.

In [4]:
re.search(r"Co.k.e", "Cookie").group()

'Cookie'

*The **search** function scan through the given string/sequence, looking for the first location where the regular expression produces a match.*

*The **group** function returns the string matched by the re.*

### **^**     
A caret matches the start of the string.

In [5]:
re.search("Eat", "Let's Eat Cake").group()

'Eat'

In [6]:
## However, the code below will not give the same result:

re.search(r'^eat', "Let's eat cake!").group()

AttributeError: 'NoneType' object has no attribute 'group'

We get *AttributeError* because *re.search" result was *None*, *(it didn't find the searched string at the begining)*,     
so we're then calling *groups()* on *None*, which hasn't any methods.   


### $
It matches the end of the string.     
Helpful if we want to make sure a document/sentence ends with certain characters.

In [7]:
re.search(r'cake$', "Cake! Let's eat cake" ).group()

'cake'

In [8]:
## The search will return the NONE value and the group() an error if we tried this:

re.search(r'cake$', "Let's get some cake on our way home!").group()

AttributeError: 'NoneType' object has no attribute 'group'

###  [abc]
It matches **a** or **b** or **c**.

### [a-zA-Z0-9]
It matches any letter from **(a to z)** or **(A to Z)** or **(0 to 9)**.

*Characters that are not within a range can be matched by complementing the set.     
On the other way around, if the first character of the set is ^, all the characters that are not in the set will be matched.*



In [9]:
re.search(r'[0-6]', 'Number: 5').group()

'5'

In [10]:
re.search(r'Number: [^5]', 'Number: 0').group()

'Number: 0'

In [11]:
# This will not match and hence a NONE value will be returned
re.search(r'Number: [^5]', 'Number: 5').group()

AttributeError: 'NoneType' object has no attribute 'group'

### / Backslash
 
The most diverse metacharacter:

- If the character following the backslash is a recognized escape character, then the special meaning of the term is taken (Scenario 1).
- Else if the character following the \ is not a recognized escape character, then the \ is treated like any other character and passed through (Scenario 2). 
- \ can be used in front of all the metacharacters to remove their special meaning (Scenario 3). 


In [12]:
## (Scenario 1) This treats '\s' as an escape character, '\s' defines a space

re.search(r'Not a\sregular character', "Not a regular character").group()

'Not a regular character'

In [13]:
## (Scenario 2) '\' is treated as an ordinary character, because '\r' is not a recognized escape character
re.search(r'Just a \regular character', "Just a \regular character").group()

'Just a \regular character'

In [14]:
## (Scenario 3) '\s' is escaped using an extra `\` so its interpreted as a literal string '\s'
re.search(r'Just a \\sregular character', 'Just a \sregular character' ).group()

'Just a \\sregular character'

### There is a predefined set of special sequences that begin with ' \ ' and are also very helpful when performing search and match 

### \w - Lowercase 'w'.     
Matches any single letter, digit, or underscore. 

### \W - Uppercase 'W'.     
Matches any character not part of \w (lowercase w).


In [15]:
print("Lowercase w:", re.search(r'Co\wk\we', 'Cookie').group())

Lowercase w: Cookie


In [16]:
## Matches any character except single letter, digit or underscore

print("Uppercase W:", re.search(r'C\Wke', 'C@ke').group())

Uppercase W: C@ke


In [17]:
## Uppercase W won't match single letter, digit

print("Uppercase W won't match, and return:", re.search(r'Co\Wk\We', 'Cookie'))

Uppercase W won't match, and return: None


### \d - Lowercase d.     
Matches decimal digit 0-9. 

### \D - Uppercase d.     
Matches any character that is not a decimal digit.

In [18]:
# Example for \d
# The + symbol used after the \d is used for repetition

print("How many cookies do you want? ", re.search(r'\d+', '100 cookies').group())

How many cookies do you want?  100


### \t - Lowercase t.     
Matches tab.

### \n - Lowercase n.     
Matches newline. 

### \r - Lowercase r.     
Matches return. 

### \A - Uppercase a.     
Matches only at the start of the string. Works across multiple lines as well. 

### \Z - Uppercase z.     
Matches only at the end of the string.     

TIP: *"^" and "\A" are effectively the same, and so are "$" and "\Z". Except when dealing with MULTILINE mode.*

### \b - Lowercase b.     
Matches only the beginning or end of the word. 



In [19]:
# Example for \t

print("\\t (TAB) example: ", re.search(r'Eat\tcake', 'Eat cake').group())

AttributeError: 'NoneType' object has no attribute 'group'

In [20]:
# Example for \b

print("\\b match gives: ",re.search(r'\b[A-E]ookie', 'Cookie').group())

\b match gives:  Cookie


## Repetitions

If you are looking to find long patterns in a sequence, the *re* module handles repetitions using the following special characters:

## +

Checks if the preceding character appears one or more times starting from that position

In [21]:
re.search(r'Co+kie', 'Coooookie').group()

'Coooookie'

## *

Checks if the preceding character appears zero or more times starting from that position

In [22]:
# Checks for any occurrence of a or o or both in the given sequence

re.search(r'Ca*o*kie', 'Cookie').group()

'Cookie'

## ?

Checks if the preceding character appears exactly zero or one time starting from that position

In [23]:
# Checks for exactly zero or one occurrence of a or o or both in the given sequence

re.search(r'Colou?r', 'Color').group()

'Color'

If you want to check for an exact number of sequence repetition,
*(for example, checking the validity of a phone number in an application)*, re module handles this using the following regular expressions:

- **{x}** - Repeat exactly x number of times. 
- **{x,}** - Repeat at least x times or more. 
- **{x, y}** - Repeat at least x times but no more than y times

In [24]:
re.search(r'\d{9,10}', '0987654321').group()

'0987654321'

## Grouping in Regular Expressions

The ***group*** feature of regular expression allows us to pick up parts of the matching text.     
Parts of a regular expression pattern bounded by parenthesis () are called *groups*.     
The parenthesis does not change what the expression matches, but rather forms groups within the matched sequence. 

We have been using the *group()* function all along in this examples. The plain *match.group()* without any argument is still the whole matched text as usual.

In [25]:
# Imagine you were validating email addresses and wanted to check the user name and host. 
# This is when you would want to create separate groups within your matched text.

statement = 'Please contact us at: support@databoot.com'
match = re.search(r'([\w\.-]+)@([\w\.-]+)', statement)
if statement:
  print("Email address:", match.group()) # The whole matched text
  print("Username:", match.group(1)) # The username (group 1)
  print("Host:", match.group(2)) # The host (group 2)


Email address: support@databoot.com
Username: support
Host: databoot.com


Another way of doing the same is with the usage of **<>** brackets instead. This will let you create **named groups**. Named groups will make your code more readable.     
The syntax for creating named group is: **(?P<name>...)**. 
    - Replace the name part with the name you want to give to your group. 
    - The ... represent the rest of the matching syntax. 
    
Using the same example as before:

In [26]:
statement = 'Please contact us at: support@databoot.com'
match = re.search(r'(?P<email>(?P<username>[\w\.-]+)@(?P<host>[\w\.-]+))', statement)
if statement:
  print("Email address:", match.group('email'))
  print("Username:", match.group('username'))
  print("Host:", match.group('host'))

Email address: support@databoot.com
Username: support
Host: databoot.com


> **TIP:** We can always access the named groups using numbers instead of the name. 
But as the number of groups increases, it gets harder to handle them using numbers alone.     
So, always make it a habit to use named groups instead.

## Greedy vs. Non Greedy Matching

When a special character **matches as much of the search sequence (string) as possible**, it is said to be a **"Greedy Match"**.     
It is the normal behavior of a regular expression, but sometimes this behavior is not desired:

In [27]:
pattern = 'Cookie'

sequence = 'Cake and cookie'

heading = r'<h1>TITLE</h1>'

re.match(r'<.*>' ,heading).group()

'<h1>TITLE</h1>'

The pattern **<.*>** matched the whole string, right up to the second occurrence of >.

However, if you only wanted to match the first < h1 > tag, you could have used the **greedy qualifier *?** that matches as little text as possible.
    
Adding **?** after the qualifier makes it perform the match in a **non-greedy or minimal fashion**. That is, as few characters as possible will be matched.     
When you run <.*>, you will only get a match with < h1 >.

In [28]:
heading = r'<h1>TITLE</h1>'

re.match(r'<.*?>', heading).group()

'<h1>'

## Summary Table

|**Character(s)**	|**What it does**     |
|:-----|:-----|
|.	|A period. Matches any single character except the newline character.|
|^	|A caret. Matches a pattern at the start of the string.|
|\A	|Uppercase A. Matches only at the start of the string.|
|$	|Dollar sign. Matches the end of the string.|
|\Z	|Uppercase Z. Matches only at the end of the string.|
|[ ]|Matches the set of characters you specify within it.|
|\	|∙ Following the backslash is a recognized escape character => the special meaning of the term is taken.|
|\  |∙ Else the backslash () is treated like any other character and passed through.|
|\  |∙ It can be used in front of all the metacharacters to remove their special meaning.|
|\w	|Lowercase w. Matches any single letter, digit, or underscore.|
|\W	|Uppercase W. Matches any character not part of \w (lowercase w).|
|\s	|Lowercase s. Matches a single whitespace character like: space, newline, tab, return.|
|\S	|Uppercase S. Matches any character not part of \s (lowercase s).|
|\d	|Lowercase d. Matches decimal digit 0-9.|
|\D	|Uppercase D. Matches any character that is not a decimal digit.|
|\t	|Lowercase t. Matches tab.|
|\n	|Lowercase n. Matches newline.|
|\r	|Lowercase r. Matches return.|
|\b	|Lowercase b. Matches only the beginning or end of the word.|
|+	|Checks if the preceding character appears one or more times.|
|*	|Checks if the preceding character appears zero or more times.|
|?	|∙ Checks if the preceding character appears exactly zero or one time.| 
|?  |∙ Specifies a non-greedy version of +, *|
|{ }|Checks for an explicit number of times.|
|( )|Creates a group when performing matches.|
|< >|Creates a named group when performing matches.

> **TIP:** Although regular expressions are very powerful and helpful, be wary of long, confusing expressions that are hard for others, and also you to understand and maintain over time.


## Functions provided by re

**compile(pattern, flags=0)**

Regular expressions are handled as strings by Python.     
However, with **compile()**, you can compute a regular expression pattern into a **regular expression object**.

When you need to use an expression several times in a single program, using *compile()* to save the resulting regular expression object for reuse is more efficient than saving it as a string.     
This is because the compiled versions of the most recent patterns passed to compile() and the module-level matching functions are cached.

In [29]:
pattern = re.compile(r"cookie")
sequence = "Cake and cookie"
pattern.search(sequence).group()

'cookie'

In [30]:
# This is equivalent to:
re.search(pattern, sequence).group()

'cookie'

**search(pattern, string, flags=0)**

With this function, you scan through the given string/sequence, looking for the first location where the regular expression produces a match.     
It returns a corresponding match object if found, else returns *None* if no position in the string matches the pattern.     
Note that *None* is different from finding a zero-length match at some point in the string.


In [31]:
pattern = "cookie"
sequence = "Cake and cookie"

re.search(pattern, sequence)

<re.Match object; span=(9, 15), match='cookie'>

**match(pattern, string, flags=0)**

Returns a corresponding match object if zero or more characters at the beginning of string match the pattern.     
Else it returns *None*, if the string does not match the given pattern.

In [32]:
pattern = "C"
sequence1 = "IceCream"
sequence2 = "Cake"

# No match since "C" is not at the start of "IceCream"

print("Sequence 1: ", re.match(pattern, sequence1))

print("Sequence 2: ", re.match(pattern,sequence2).group())

Sequence 1:  None
Sequence 2:  C


**search() versus match()**

The **match()** function checks for a match **only at the beginning** of the string (by default), whereas the **search()** function checks for a match **anywhere in the string**.


**findall(pattern, string, flags=0)**

Finds all the possible matches in the entire sequence and returns them as a list of strings.     
Each returned string represents one match.

In [33]:
statement = "Please contact us at: support@databoot.com, xyz@databoot.com"

#'addresses' is a list that stores all the possible match

addresses = re.findall(r'[\w\.-]+@[\w\.-]+', statement)
for address in addresses:
    print(address)

support@databoot.com
xyz@databoot.com


**finditer(string, [position, end_position])**

Similar to *findall()* - it finds all the possible matches in the entire sequence but returns regex match objects as an iterator.

> **TIP:** finditer() might be an excellent choice when you want to have more information returned to you about your search. The returned regex match object holds not only the sequence that matched but also their positions in the original text.

In [34]:
statement = "Please contact us at: support@databoot.com, xyz@databoot.com"

#'addresses' is a list that stores all the possible match

addresses = re.finditer(r'[\w\.-]+@[\w\.-]+', statement)
for address in addresses:
    print(address)

<re.Match object; span=(22, 42), match='support@databoot.com'>
<re.Match object; span=(44, 60), match='xyz@databoot.com'>


**sub(pattern, repl, string, count=0, flags=0)**

**subn(pattern, repl, string, count=0)**

*sub()* is the substitute function. It returns the string obtained by replacing or substituting the leftmost non-overlapping occurrences of pattern in string by the replacement *repl*. If the pattern is not found, then the string is returned unchanged.
The *subn()* is similar to *sub()*. However, it returns a tuple containing the new string value and the number of replacements that were performed in the statement.


In [35]:
statement = "Please contact us at: xyz@databoot.com"
new_email_address = re.sub(r'([\w\.-]+)@([\w\.-]+)', r'support@databoot.com', statement)
print(new_email_address)

Please contact us at: support@databoot.com


**split(string, [maxsplit = 0])**

This splits the strings wherever the pattern matches and returns a list. If the optional argument ***maxsplit*** is nonzero, then the maximum 'maxsplit' number of splits are performed.

In [36]:
statement = "Please contact us at: xyz@databoot.com, support@databoot.com"
pattern = re.compile(r'[:,]')

address = pattern.split(statement)
print(address)

['Please contact us at', ' xyz@databoot.com', ' support@databoot.com']


**start()** - Returns the starting index of the match. 

**end()** - Returns the index where the match ends. 

**span()** - Return a tuple containing the (start, end) positions of the match

In [37]:
pattern = re.compile('COOKIE', re.IGNORECASE)
match = pattern.search("I am not a cookie monster")

print("Start index:", match.start())
print("End index:", match.end())
print("Tuple:", match.span())

Start index: 11
End index: 17
Tuple: (11, 17)


## Compilation Flags

Expressions behavior can be modified by specifying a **flag** value.     
You can add flags as an extra argument to many different functions that we have seen in this tutorial.     
Some of the more useful ones are:


**IGNORECASE (I)** - Allows case-insensitive matches.

**DOTALL (S)** - Allows . to match any character, including newline.

**MULTILINE (M)** - Allows start of string (^) and end of string ($) anchor to match newlines as well.

**VERBOSE (X)** - Allows you to write whitespace and comments within a regular expression to make it more readable. 

In [38]:
statement = "Please contact us at: support@DataBoot.com, xyz@DATABOOT.com"

# Using the VERBOSE flag helps understand complex regular expressions

pattern = re.compile(r"""
[\w\.-]+ #First part
@ #Matches @ sign within email addresses
databoot.com #Domain
""", re.X | re.I)

addresses = re.findall(pattern, statement)                       
for address in addresses:
    print("Address: ", address)

Address:  support@DataBoot.com
Address:  xyz@DATABOOT.com


> **TIP**: We can also combine multiple flags by using *bitwise* OR */*.


# CASE STUDY - Working with Regular Expressions

In [39]:
import re
import requests
the_idiot_url = 'https://www.gutenberg.org/files/2638/2638-0.txt'

def get_book(url):
    # Sends a http request to get the text from project Gutenberg
    raw = requests.get(url).text
    # Discards the metadata from the beginning of the book
    start = re.search(r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK .* \*\*\*",raw ).end()
    # Discards the text starting Part 2 of the book
    stop = re.search(r"II", raw).start()
    # Keeps the relevant text
    text = raw[start:stop]
    return text

def preprocess(sentence):
    return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower()

book = get_book(the_idiot_url)
processed_book = preprocess(book)
#print(processed_book)

In [40]:
# Exercise I: 
# Find the number of the pronoun "the" in the corpus. Hint: Use the len() function.

len(re.findall(r"the", processed_book))

302

In [41]:
# Exercise II: 
# Try to convert every single stand-alone instance of 'i' to 'I' in the corpus. 
# Make sure not to change the 'i' occurring within a word:


processed_book = re.sub(r'\si\s', " I ", processed_book)
#print(processed_book)
       

In [45]:
# Exercise III: 
# Find the number of times anyone was quoted ("") in the corpus.

len(re.findall(r'\”', book))


0

In [47]:
# Exercise IV: What are the words connected by '--' in the corpus? 

re.findall(r'\w+(?:--\w+)+', book)

['ironical--it',
 'malicious--smile',
 'fur--or',
 'astrachan--overcoat',
 'it--the',
 'Italy--was',
 'malady--a',
 'money--and',
 'little--to',
 'No--Mr',
 'is--where',
 'I--I',
 'I--â',
 'crime--we',
 'or--judge',
 'gaiters--still--if',
 'through--well',
 'say--through',
 'however--and',
 'Epanchin--oh',
 'too--at',
 'was--and',
 'Andreevitch--that',
 'everyone--that',
 'reduce--or',
 'raise--to',
 'listen--and',
 'history--but',
 'individual--one',
 'yes--I',
 'but--â',
 't--not',
 'me--then',
 'perhaps--â',
 'Yes--those',
 'me--is',
 'servility--if',
 'Rogojin--hereditary',
 'citizen--who',
 'least--goodness',
 'memory--but',
 'latter--since',
 'Rogojin--hung',
 'him--I',
 'anything--sheâ',
 'old--and',
 'you--scarecrow',
 'certainly--certainly',
 'father--I',
 'Barashkoff--I',
 'see--and',
 'everything--Lebedeff',
 'about--heâ',
 'now--I',
 'Lihachof--â',
 'Zaleshoff--looking',
 'old--fifty',
 'so--and',
 'this--do',
 'day--not',
 'that--â',
 'do--by',
 'know--my',
 'illness--I',


- \w+ any single letter, digit or underscore, one or more times.
- ?:  non greedy version of + *(not sure about this)*
- --  the chonector
- \w+ any single letter, digit or underscore, one or more times.