In [1]:
'''
You can easily tackle many basic patterns in Python using the ordinary characters. 
Ordinary characters are the simplest regular expressions. 
They match themselves exactly and do not have a special meaning in their regular expression syntax.

Ordinary characters can be used to perform simple exact matches
'''

import re

pattern = r"Coding!" # the r is a raw string literal, changes how the string literal is interpreted
'''
For example, \ is just a backslash when prefixed with a r rather than being interpreted as an escape sequence.
This is important with special characters.
We don't actually need it here but it helps with consistency :)
'''

sequence = "Coding!"
if re.match(pattern, sequence): # reutrn a match object if the text matches the pattern, if not returns none
    print("Match!")
else: 
    print("Not a match!")

Match!


In [2]:
'''
Special characters are characters which do not match themselves as seen 
but actually have a special meaning when used in a regular expression.
The most widely used special characters are:
. - A period. Matches any single character except newline character.
\w - Lowercase w. Matches any single letter, digit or underscore.
\W - Uppercase w. Matches any character not part of \w (lowercase w).
\s - Lowercase s. Matches a single whitespace character like: space, newline, tab, return.
\S - Uppercase s. Matches any character not part of \s (lowercase s).
\t - Lowercase t. Matches tab.
\n - Lowercase n. Matches newline.
\r - Lowercase r. Matches return.
\d - Lowercase d. Matches decimal digit 0-9.
^ - Caret. Matches a pattern at the start of the string.
$ - Matches a pattern at the end of string.
[abc] - Matches a or b or c.
[a-zA-Z0-9] - Matches any letter from (a to z) or (A to Z) or (0 to 9). 
Characters that are not within a range can be matched by complementing the set. 
If the first character of the set is ^, all the characters that are not in the set will be matched.
\A - Uppercase a. Matches only at the start of the string. Works across multiple lines as well.
\b - Lowercase b. Matches only the beginning or end of the word.
\ - Backslash. If the character following the backslash is a recognized escape character, 
then the special meaning of the term is taken. 
For example, \n is considered as newline. 
However, if the character following the \ is not a recognized escape character, 
then the \ is treated like any other character and passed through.

Python offers two different primitive operations based on regular expressions: 
match checks for a match only at the beginning of the string, 
while search checks for a match anywhere in the string (this is what Perl does by default).
Note that match may differ from search even when using a regular expression beginning with '^': 
'^' matches only at the start of the string, 
or in MULTILINE mode also immediately following a newline. 
The “match” operation succeeds only if the pattern matches at the start of the string regardless of mode, 
or at the starting position given by the optional pos argument regardless of whether a newline precedes it.
'''

import re

a = re.search(r'Co.k.e', 'Cookie').group()
print(a)

b = re.search(r'Co\wk\we', 'Cookie').group()
print(b)

c = re.search(r'C\Wke', 'C@ke').group()
print(c)

d = re.search(r'Eat\scake', 'Eat cake').group()
print(d)

e = re.search(r'Cook\Se', 'Cookie').group()
print(e)

'''
f = re.search(r'Eat\tcake', 'Eat    cake').group()
print(f)
'''

g = re.search(r'c\d\dkie', 'c00kie').group()
print(g)

h = re.search(r'^Eat', 'Eat cake').group()
print(h)

i = re.search(r'cake$', 'Eat cake').group()
print(i)

j = re.search(r'Number: [0-6]', 'Number: 5').group()
print(j)

# matches any character except 5
k = re.search(r'Number: [^5]', 'Number: 0').group()
print(k)

l = re.search(r'\A[A-E]ookie', 'Cookie').group()
print(l)

m = re.search(r'\b[A-E]ookie', 'Cookie').group()
print(m)

# This checks for '\' in the string instead of '\t' due to the '\' used 
n = re.search(r'Back\\stail', 'Back\stail').group()
print(n)

# This treats '\s' as an escape character because it lacks '\' at the start of '\s'
o = re.search(r'Back\stail', 'Back tail').group()
print(o)

Cookie
Cookie
C@ke
Eat cake
Cookie
c00kie
Eat
cake
Number: 5
Number: 0
Cookie
Cookie
Back\stail
Back tail


In [3]:
'''
It becomes quite tedious if you are looking to find long patterns in a sequence. 
Fortunately, the re module handles repetitions using the following special characters:
+ - Checks for one or more characters to its left.
* - Checks for zero or more characters to its left.
? - Checks for exactly zero or one character to its left.
'''

import re

a = re.search(r'Co+kie', 'Cooookie').group()
print(a)

# Checks for any occurrence of a or o or both in the given sequence
b = re.search(r'Ca*o*kie', 'Caokie').group() # The + and * qualifiers are said to be greedy
print(b)

# Checks for exactly zero or one occurrence of a or o or both in the given sequence
c = re.search(r'Colou?r', 'Color').group()
print(c)

'''
But what if you want to check for exact number of sequence repetition?
For example, checking the validity of a phone number in an application. 
re module handles this very gracefully as well using the following regular expressions:
{x} - Repeat exactly x number of times.
{x,} - Repeat at least x times or more.
{x, y} - Repeat at least x times but no more than y times.
'''

d = re.search(r'\d{9,10}', '0987654321').group()
print(d)

Cooookie
Caokie
Color
0987654321


In [4]:
'''
The group feature of regular expression allows you to pick up parts of the matching text.
Parts of a regular expression pattern bounded by parenthesis() are called groups. 
The parenthesis does not change what the expression matches, but rather forms groups within the matched sequence. 
You have been using the group() function all along in this tutorial's examples. 
The plain match.group() without any argument is still the whole matched text as usual.
'''

import re 

email_address = 'Please contact us at: support@datacamp.com'
match = re.search(r'([\w\.-]+)@([\w\.-]+)', email_address)
if match: # if this is true, meaning there is a match
    print(match.group()) # The whole matched text
    print(match.group(1)) # The username (group 1)
    print(match.group(2)) # The host (group 2)

support@datacamp.com
support
datacamp.com


In [5]:
'''
When a special character matches as much of the search sequence (string) as possible, 
it is said to be a "Greedy Match". 
It is the normal behavior of a regular expression but sometimes this behavior is not desired:
'''

import re

pattern = "cookie"
sequence = "Cake and cookie"

heading  = r'<h1>TITLE</h1>'
a = re.match(r'<.*>', heading).group()
print(a)

'''
The pattern <.*> matched the whole string, right up to the second occurrence of >.
However, if you only wanted to match the first <h1> tag, 
you could have used the greedy qualifier *? that matches as little text as possible.
Adding ? after the qualifier makes it perform the match in a non-greedy or minimal fashion; 
That is, as few characters as possible will be matched. 
When you run <.*>, you will only get a match with <h1>.
'''

heading2  = r'<h1>TITLE</h1>'
b = re.match(r'<.*?>', heading2).group()
print(b)

<h1>TITLE</h1>
<h1>


In [6]:
'''
The re library in Python provides several functions that makes it a skill worth mastering. 
You have already seen some of them, such as the re.search(), re.match(). 
Let's check out some useful functions in detail:
    search(pattern, string, flags=0)
With this function, you scan through the given string/sequence looking for the first location 
where the regular expression produces a match. 
It returns a corresponding match object if found, 
else returns None if no position in the string matches the pattern. 
Note that None is different from finding a zero-length match at some point in the string.
'''

import re

pattern = "cookie"
sequence = "Cake and cookie"

a = re.search(pattern, sequence).group()
print(a)

'''
    match(pattern, string, flags=0)
Returns a corresponding match object if zero or more characters at the beginning of string match the pattern. 
Else it returns None, if the string does not match the given pattern.
NOTE: The match() function checks for a match only at the beginning of the string (by default) 
whereas the search() function checks for a match anywhere in the string.
'''

pattern2 = "C"
sequence1 = "IceCream"

# No match since "C" is not at the start of "IceCream"
b = re.match(pattern2, sequence1)
print(b)

sequence2 = "Cake"

re.match(pattern2,sequence2).group()
print(c)


cookie
None
Color


In [7]:
'''
    findall(pattern, string, flags=0)
Finds all the possible matches in the entire sequence and returns them as a list of strings. 
Each returned string represents one match.
'''

import re

email_address = "Please contact us at: support@datacamp.com, xyz@datacamp.com"

#'addresses' is a list that stores all the possible match
addresses = re.findall(r'[\w\.-]+@[\w\.-]+', email_address)
for address in addresses: # printing the list of strings
    print(address)

support@datacamp.com
xyz@datacamp.com


In [8]:
'''
    sub(pattern, repl, string, count=0, flags=0)
This is the substitute function. It returns the string obtained by replacing or substituting 
the leftmost non-overlapping occurrences of pattern in string by the replacement repl. 
If the pattern is not found then the string is returned unchanged.
'''

import re

email_address = "Please contact us at: xyz@datacamp.com"
new_email_address = re.sub(r'([\w\.-]+)@([\w\.-]+)', r'support@datacamp.com', email_address)
print(new_email_address)

Please contact us at: support@datacamp.com


In [9]:
'''
    compile(pattern, flags=0)
Compiles a regular expression pattern into a regular expression object. 
When you need to use an expression several times in a single program, 
using the compile() function to save the resulting regular expression object for reuse is more efficient. 
This is because the compiled versions of the most recent patterns passed to compile() 
and the module-level matching functions are cached.

Tip : an expression's behavior can be modified by specifying a flags value. 
You can add flag as an extra argument to the various functions that you have seen in this tutorial. 
Some of the flags used are: IGNORECASE, DOTALL, MULTILINE, VERBOSE, etc.
'''

import re

pattern = re.compile(r"cookie")
sequence = "Cake and cookie"
a = pattern.search(sequence).group()
print(a)

# This is equivalent to:
b = re.search(pattern, sequence).group()
print(b)

cookie
cookie


In [21]:
# case study

import re
import requests
the_idiot_url = 'https://www.gutenberg.org/files/2638/2638-0.txt'

def get_book(url):
    # Sends a http request to get the text from project Gutenberg
    raw = requests.get(url).text
    
    # Discards the metadata from the beginning of the book
    start = re.search(r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK .* \*\*\*",raw ).end()
    
    # Discards the metadata from the end of the book
    stop = re.search(r"II", raw).start()
    
    # Keeps the relevant text
    text = raw[start:stop]
    return text

def preprocess(sentence): 
    return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower()

book = get_book(the_idiot_url)
quote = len(re.findall(r'\”', book)) # Find the number of times anyone was quoted ("") in the corpus.
print(quote)

processed_book = preprocess(book)
# print(processed_book)
print()

num = len(re.findall(r'the', processed_book)) # Find the number of the pronoun "the" in the corpus.
print(num)
print()

# Try to convert every single stand-alone instance of 'i' to 'I' in the corpus.
processed_book_i = re.sub(r'\si\s', " I ", processed_book)
#print(processed_book_i)

# What are the words connected by '--' in the corpus?
dashconnect = re.findall(r'[a-zA-Z0-9]*--[a-zA-Z0-9]*', book)
print(dashconnect)

0
()
302
()
[u'ironical--it', u'malicious--smile', u'fur--or', u'astrachan--overcoat', u'it--the', u'Italy--was', u'malady--a', u'money--and', u'little--to', u'No--Mr', u'is--where', u'I--I', u'I--', u'--though', u'crime--we', u'or--judge', u'gaiters--still', u'--if', u'through--well', u'say--through', u'however--and', u'Epanchin--oh', u'too--at', u'was--and', u'Andreevitch--that', u'everyone--that', u'reduce--or', u'raise--to', u'listen--and', u'history--but', u'individual--one', u'yes--I', u'but--', u't--not', u'me--then', u'perhaps--', u'Yes--those', u'me--is', u'servility--if', u'Rogojin--hereditary', u'citizen--who', u'least--goodness', u'memory--but', u'latter--since', u'Rogojin--hung', u'him--I', u'anything--she', u'old--and', u'you--scarecrow', u'certainly--certainly', u'father--I', u'Barashkoff--I', u'see--and', u'everything--Lebedeff', u'about--he', u'now--I', u'Lihachof--', u'Zaleshoff--looking', u'old--fifty', u'so--and', u'this--do', u'day--not', u'that--', u'do--by', u'kn