# Regular Expression

Regular expression are used to identify whether a pattern exists in a given sequence of characters (string) or not. 

In [1]:
import re

## Basic Patterns: Ordinary Characters

Ordinary characters are the simplest regular expressions. They match themselves exactly and do not have a special meaning in their regular expression syntax.

In [2]:
pattern = r"Cookie"
sequence = "Cookie"

if re.match(pattern, sequence):
    print("Match!")
else:
    print("Not a match!")

# The match() function returns a match object if the text 
# matches the pattern. Otherwise it return None. 

# There is r at the start of the pattern Cookie. This is called
# a raw string literal. It changes how the string literal is 
# interpreted. Such literals are stored as they appear.

# For example, \ is just a backslash when prefixed with a r rather
# than being interpreted as an escape sequence. Sometimes, the 
# syntax involves backslash-escaped characters and to prevent
# these characters from being interpreted as escape sequences, we
# use the raw r prefix. 


Match!


# Wild Card characters: Special characters.

Special characters are characters which do not match themselves as seen but actually have a special meaning when used in a regular expression.

In [3]:
# . - A period: Matches any single character except new line character.

re.search(r'Co.k.e', 'Cookie').group()

'Cookie'

In [4]:
# \w - Lowercase w: Matches any single letter, digit or underscore.
re.search(r'Co\wk\we', 'Cookie').group()

'Cookie'

In [5]:
# \W - Uppercase w: Matches any character not part of \w (lowercase w).
re.search(r'Co\Wkie', 'Co@kie').group()

'Co@kie'

In [6]:
# \s - Lowercase s: Matches any single whitespace character like space, newline, tab, return
re.search(r'Eat\scake', 'Eat cake').group()

'Eat cake'

In [7]:
# \S - Uppercase s: Matches any character not part of \s (lowercase s)
re.search(r'Cook\Se', 'Cookie').group()

'Cookie'

In [8]:
# \t - Lowercase t: Matches tab
# \n - Lowercase n: Matches newline.
# \r - Lowercase r: Matches return
# \d - Lowercase d: Matches decimal digit 0-9
re.search(r'c\d\dkie', 'c00kie').group()

'c00kie'

In [9]:
# ^ - Caret: Matches pattern at the start of the string.
re.search(r'^Eat', 'Eat cake').group()

'Eat'

In [10]:
# $ : Matches a pattern at the end of the string.
re.search(r'cake$', 'Eat cake').group()

'cake'

In [11]:
# [abc] : Matches a or b or c
# [a-zA-Z0-9] : Matches any letter from (a to z) or (A to Z) or (0 to 9)
# If the first character of the set is ^, all the characters that are not
# in the set will be matches.
re.search(r'Number: [0-6]', 'Number: 5').group()

'Number: 5'

In [12]:
# Matches any character except 5
re.search(r'Number: [^5]', 'Number: 0').group()

'Number: 0'

In [13]:
# \A - Uppercase a: Matches only at the start of the string. Works
# across multiple lines as well
re.search(r'\A[A-E]ookie','Cookie').group()

'Cookie'

In [14]:
# \b - Lowercase b: Matches only the beginning or end of the work.
re.search(r'\bCooki[e-z]', 'Cookie').group()

'Cookie'

In [15]:
# \ - Backslash. If the character following the backslash is a 
# recognized escape character, then the special meaning of the
# term is taken. 

re.search(r'Back\\stail','Back\stail').group()

'Back\\stail'

In [16]:
re.search(r'Back\stail','Back tail').group()

'Back tail'

## Repetitions 

In [17]:
# + : Checks for one or more characters to its left.
re.search(r'Co+kie', 'Coookie').group()

'Coookie'

In [18]:
# * : Checks for zero or more characters to its left.
re.search(r'Ca*o*kie', 'Cokie').group()

'Cokie'

In [19]:
# ? : Checks for exactly zero or one character to its left.
re.search(r'Colou?r', 'Color').group()

'Color'

In [20]:
# {x} : Repeat exactly x number of times.
# {x,} : Repeat at least x times or more.
# {x,y} : Repeat at least x times but no more than y times.

In [21]:
re.search(r'\d{9,10}', '0987654321').group()

'0987654321'

## Groups and Grouping using Regular Expressions

Suppose that, when we're validating email addresses and want to check the user name and host separately.

This is when the group feature of regular expression comes in handy. It allows us to pick up parts of the matching text.

Parts of a regular expression pattern bounded by parenthesis() are called groups. The paranthesis does not change what the expression matches, but rather forms groups within the matches sequence. The plain match.group() without any argumet is still the whole matches text as usual.



In [22]:
email_address = 'Please contact us at: support@datacamp.com'
match = re.search(r'([\w\.-]+)@([\w\.-]+)', email_address)

if match:
    print(match.group())
    print(match.group(1))
    print(match.group(2))

support@datacamp.com
support
datacamp.com


# Greedy vs Non-Greedy Matching


In [23]:
# When a special character matches as much of the search sequence (string)
# as possible, it is said to be "Greedy Match". It is the normal
# behaviour of a regular expression but sometimes this behavior is not desired

heading = r'<h1>TITLE</h1>'
re.match(r'<.*>', heading).group()

# The pattern <.*> matched the whole string, right up to the second
# occurrence of >.


'<h1>TITLE</h1>'

In [24]:
# However, if we only wanted to match the first <h1> tag, we could
# have used the non greedy qualifier *? that matches as little text as possible.
# Adding ? after the qualifier makes it perform the matchin in a non-greedy
# or minimal fashion. That is, as few characters as possible will be matched.

heading = r'<h1>TITLE</h1>'
re.match(r'<.*?>', heading).group()

'<h1>'

# match vs search vs findall

search(pattern, string, flags=0): With this function, we scan through the given string/sequence looking for the first location where the regular expression produces a match. It returns a corresponding match object if found, else return None if no position in the string matches the pattern. 


match(pattern, string, flags=0): Returns a corresponding match object if zero or more characters at the beginning of string match the pattern. Else it returns None, if the string doesnot match the given pattern.


findall(pattern, string, flags=0): Find all the possible matches in the entire sequence and returns them as a list of string. Each return string represents one matchs.



In [25]:
pattern = 'co+kie'
sequence = 'cake and cookie and cooookiesing'
re.search(pattern, sequence).group()

'cookie'

In [26]:
print(re.match("C", "IceCream"))
print(re.match("C", "Cake").group())
print(re.match("Ca", "Cool"))

None
C
None


In [27]:
email_address = "Please contact us at: abc@abc.com bbc@bbc.com ccb@ccb.com"

addresses = re.findall(r'[\w\.-]+@[\w\.-]+', email_address)

for address in addresses:
    print(address)

abc@abc.com
bbc@bbc.com
ccb@ccb.com


### sub(pattern, repl, string, count=0, flags=0)

This is the substitue function. It returns the string obtained by replacing or substituting the leftmost non-overlapping occurrences of pattern in string by the replacement rep. If the pattern is not found then the string is returned unchanged.



In [28]:
email_address = "Please contact us at: abc@abc.com bbc@bbc.com"
new_email_address = re.sub(r'([\w\.-]+)@([\w\.-]+)',r'final@final.com', email_address)
new_email_address

'Please contact us at: final@final.com final@final.com'

### compile(pattern, flags=0)

Compiles a regular expression pattern into a regular expression object. When you need to use an expression several times in a single program, using the compile() function to save the resulting regular expression object for reuse is more efficient. 

In [29]:
pattern = re.compile(r'cookie')
sequence = "Cake and cookie"
print(pattern.search(sequence).group())

# This is equivalent to:
print(re.search(pattern, sequence).group())

cookie
cookie
