# Intro to Regular Expressions

It's a language to describe patterns in text. 

Regular expressions are implemented in every major programming language.

They may differ slightly between languages, but are largely the same. 

In Python we use regular expressions through the "re" package. 

For our purposes, we will focus on the "search" function, which provides basic "global" pattern matching on a string, determining whether or not the pattern is matched and returning the location of the match if it exists. 

In [2]:
from re import search

In [7]:
# Basic letters and numbers are valid regular expressions

assert(search(r'cat', 'a cat went home') != None)

In [8]:
cassert(search(r'cat', 'a dog went home') == None)

In [9]:
# You can include "optional" letters by following the letter with 
# a question mark: 

assert(search(r'cats?', 'a cat went home') != None)
assert(search(r'cats?', 'cats went home') != None)

In [17]:
# You can include a search for "one or more" with the +
# For example, let's assume we want only to match an 
# exclamation of "cat" that ends with one or more
# exclamation points: 

assert(search(r'cats?!+', 'a cat went home') == None)
assert(search(r'cats?!+', 'a cat!') != None)
assert(search(r'cats?!+', 'a cat!!!') != None)

In [21]:
# Note that our example isn't only matching the
# word "cat":

assert(search(r'cat', 'a category') != None)

In [24]:
# We can build up a pattern of characters and spaces
# A great character for "space" is given by the \s
# expression. Backslashes in regular expressions
# denote "special characters", such as \s:

assert(search(r'\scat\s', 'a category') == None)
assert(search(r'\scat\s', 'a cat went home') != None)

In [28]:
# But now our expression doesn't match the following: 

assert(search(r'\scat\s', 'a cat') == None)
assert(search(r'\scat\s', 'a cat.') == None)
assert(search(r'\scat\s', 'cat') == None)

# which seems problematic

In [65]:
# There is another special character, \b, 
# which is very powerful for this common scenario: 

assert(search(r'\bcat\b', 'a cat') != None)
assert(search(r'\bcat\b', 'a cat.') != None)
assert(search(r'\bcat\b', 'cat') != None)
assert(search(r'\bcat\b', 'a cat went home') != None)

In [73]:
# Another useful special character is the \w
# character. It matches any "word character" which 
# refers to, basically, letters, numbers and underscores
# This can be used, for example, to find hashtags: 

assert(search(r'#\w+', 'a #cat') != None)
assert(search(r'#\w+', 'a #@home') == None)
assert(search(r'#\w+', 'a #') == None)

In [74]:
# We can also negate things using ^. For example, we might be
# interested in anything that's NOT a space character:
# Note: when negating, you must surround the negated part
# with square brackets []

assert(search(r'#[^\s]+', 'a #cat') != None)
assert(search(r'#[^\s]+', 'a #c@t') != None)
assert(search(r'#[^\s]+', 'a #@home') != None)

In [13]:
# You can use a logical or with "|"

assert(search('cat|dog', 'a dog went home') != None)
assert(search('cat|dog', 'a cat went home') != None)