# Python Regular Expressions

I recently worked on a small bit of regex code to do string processing in javacsript. This is a quick intro to regex and how to properly use it, with small code examples. 

In [1]:
import re

In [2]:
# a small shorthand for printing the matches in regex objects
def mprint(match):
    print(match.string[match.start():match.end()])

In [3]:
# A basic start. RegEx is made to match certain string patterns. 

# To match the character 'a':

string = 'This is a string with a lot of "a" characters inside'
regex = r'a'

# finditer simply returns all the matches 
# so here we find all of the 'a' characters
for match in re.finditer(regex,string):
    mprint(match)
    
# split takes the matches and splits on them
print(re.split(regex, 'Thisasentenceaisasplitabyathea"A"s'))

a
a
a
a
a
['This', 'sentence', 'is', 'split', 'by', 'the', '"A"s']


In [4]:
# Here is a second regex for the word monkey
# we also introduce the ^ and $ characters.
# ^ means you are starting at the beginning of the string
# $ means end of the string.
regex = r'^monkey$'

for match in re.finditer(regex,'monkey'):
    mprint(match)

monkey


In [5]:
# this does not print anything because
# we have characters in the string before monkey
# and after monkey.
for match in re.finditer(regex,'I like monkeys'):
    mprint(match)

In [6]:
# same for this
for match in re.finditer(regex,'monkey business'):
    mprint(match)

In [7]:
# if we use this regex, the above strings all have matches
regex = r'monkey'
for match in re.finditer(regex, 'monkey'):
    mprint(match)

monkey


In [8]:
for match in re.finditer(regex, 'I like monkeys and monkey business'):
    mprint(match)

monkey
monkey


#### The `$` and `^`in multi-line strings are actually used to indicate the start and end of the lines, not the string. An important distinction for multiline regex. 

In [9]:
# note how we only match the first cool at the start of the line
regex = r'^cool'
multi_line = '''cool, this string has more than one line, and the
regular expression will only match the first cool, because we only want to match
the first character at the start of the line when we match cool, which is what we mean when we say 
^ in a multi line string context.
'''
for match in re.finditer(regex, multi_line):
    mprint(match)

cool


# Character Sets
Along with matching specific strings and characters, regular expressions have a syntax build in for matching sets of 
different characters. The import Regular Expression characters for a set are `[` and `]`. Anything between the brackets
will be included in the set. The hyphen, `-`, can be used to indicate a range of characters to be recognized.

In [10]:
# we want to match the characters a,b,c
regex = r'[abc]'
for match in re.finditer(regex, 'abcdefghijklmnopqrstuvqxyz'):
    mprint(match)

a
b
c


In [11]:
# here we show how to use the - character in a regex
regex = r'[a-f]'
for match in re.finditer(regex, 'abcdefghijklmnopqrstuvqxyz'):
    mprint(match)

a
b
c
d
e
f


There are shorthands for some predefined character sets, such as digits (0-9) or all alphanumeric characters. A short, 
non-exhaustive list:
    - \d digits
    - \w all word characters
    - \s all whitespace characters
    - \h all hexidecimal characters

In [12]:
# we want to get all numbers (0-9 or \d)
regex = '\d'
for match in re.finditer(regex, '123 is my favorite number'):
    mprint(match)

1
2
3


In [13]:
# character set of 0-9, same as \d
# regex uses the - as a way to get the range
# of unicode values that are accepted. 
# in this case, the values for 0, 1, 2 ... through 9
regex = '[0-9]'
for match in re.finditer(regex, 'this also has 123 in it'):
    mprint(match)

1
2
3
