<H1>Regular Expressions</H1>

<p>This page was derived from these two websites:</p>
<p>http://docs.activestate.com/komodo/4.4/regex-intro.html</p>
<p>https://docs.python.org/2/howto/regex.html</p>
<p>There was a on more information and documentation on the second web site than included here.</p>
<p>A regular expression is often called a "regex", "rx" or "re". This primer uses the terms "regular expression" and "regex".</p>




<p>The re module was added in Python 1.5, and provides Perl-style regular expression patterns. Earlier versions of Python came with the regex module, which provided Emacs-style patterns. The regex module was removed completely in Python 2.5.</p>

<p>Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.</p>

<p>Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C. For advanced use, it may be necessary to pay careful attention to how the engine will execute a given RE, and write the RE in a certain way in order to produce bytecode that runs faster. Optimization isn’t covered in this document, because it requires that you have a good understanding of the matching engine’s internals.</p>

<p>The regular expression language is relatively small and restricted, so not all possible string processing tasks can be done using regular expressions. There are also tasks that can be done with regular expressions, but the expressions turn out to be very complicated. In these cases, you may be better off writing Python code to do the processing; while Python code will be slower than an elaborate regular expression, it will also probably be more understandable.</p>

<p>Regular expressions are a syntactical shorthand for describing patterns. They are used to find text that matches a pattern, and to replace matched strings with other strings. They can be used to parse files and other input, or to provide a powerful way to search and replace.</p>

<h2>Literal Match</h2>

<p>The simplest type of regex is a literal match. Letters, numbers and most symbols in the expression will match themselves in the the text being searched; an "a" matches an "a", "cat" matches "cat", "123" matches "123" and so on. For example:</p>

In [22]:
#Regular expressions are compiled into pattern objects, 
#which have methods for various operations such as searching for 
#pattern matches or performing string substitutions.

import re
p = re.compile('hello')                  # complile our regular expression to create a regular expression
                                         # pattern opbect.
                                         # in this case, the pattern object looks
                                         # for an exact litteral match for the string hello

print "\nWhat is the object type of a compiled regular expression?"
print  type(p)

match_object = p.match('hello world')    # create match object agaisnt the string 'hello world'
print "\nWhat is the object type of a pattern object run against a string?"
print type(match_object)                    

# once the regular expression has been run against a string, we can check the match object
# to see if our expression matched anything in the searched string.
print "What is the group m"     
print match_object.group()          

print match_object.start()               # Return the starting position of the match
print match_object.end()                 # Return the ending position of the match
print match_object.span()                # Return a tuple containing the (start, end) positions of the match


What is the object type of a compiled regular expression?
<type '_sre.SRE_Pattern'>

What is the object type of a pattern object run against a string?
<type '_sre.SRE_Match'>
What is the group m
hello
0
5
(0, 5)


<h3>No Match Found</h3>

<p>What happens if our search string doesn't have a match for our regular expression </p>

In [16]:
match_object = p.match('goodbye world') 
print match_object.group()          

print match_object.start()               # Return the starting position of the match
print match_object.end()                 # Return the ending position of the match
print match_object.span()                # Return a tuple containing the (start, end) positions of the match

AttributeError: 'NoneType' object has no attribute 'group'

<p>When the search string isn't matched by the regular expression the match object is of type None.  

In [27]:
match_object = p.match('goodbye world') 
if match_object is None:
    print "No match found"
else:
    print match_object.group()
    
match_object = p.match('hello world') 
if match_object is None:
    print "No match found"
else:
    print match_object.group()


No match found
No match found


<h3>Match() method vs. Search Method()</h3>

<p>The match method will only find a matching pattern if it occurs at the beginning of string.  The Search method will find a match anywhere in the string </p>


In [28]:
import re
p = re.compile('world')                  # complile our regular expression to create a regular expression

match_object = p.match('goodbye world') 
if match_object is None:
    print "No match found"
else:
    print match_object.group()

search_object = p.search('goodbye world') 
if search_object is None:
    print "No match found"
else:
    print "What is the object type of a search object?"
    print type(search_object)
    print search_object.group()
    print search_object.start()               # Return the starting position of the match
    print search_object.end()                 # Return the ending position of the match
    print search_object.span()                # Return a tuple containing the (start, end) positions of the match        

No match found
What is the object type of a search object?
<type '_sre.SRE_Match'>
world
8
13
(8, 13)


<h3>Finding All Matches - findall() and finditer() methods.</h3>

<p>Two pattern methods return all of the matches for a pattern, findall() and finditer. findall() returns a list of matching strings:<p>


In [30]:
# example findall()
import re
p = re.compile('goodbye')                  # complile our regular expression to create a regular expression

match_object = p.findall('goodbye cruel world, goodbye') 
if match_object is None:
    print "No match found"
else:
    print match_object


['goodbye', 'goodbye']


In [33]:
#example finditer()
import re
p = re.compile('goodbye')                  # complile our regular expression to create a regular expression

match_object = p.finditer('goodbye cruel world, goodbye') 
if match_object is None:
    print "No match found"
else:
    for match in match_object:
        print match.group()
        print match.span()


goodbye
(0, 7)
goodbye
(21, 28)


<h2>Case in Regular Expressions</h2>

<p>Regular expressions are case, sensitive, but you can use the IGNORECASE option on the compile method to ignore case </p>


In [37]:

import re
p = re.compile('goodbye')   # without IGNORECASE willonly match goodbye.              
match_object = p.findall('Goodbye cruel world, goodbye') 
print match_object

p = re.compile('goodbye', re.IGNORECASE)                  # Using IGNORECASE should match both Goodbyes.

match_object = p.findall('Goodbye cruel world, goodbye') 
print match_object


['goodbye']
['Goodbye', 'goodbye']


<h2>Using Wildcards in Regular Expressions, the "." Metacharacter </h2>

<p>Regex characters that perform a special function instead of matching themselves literally are called <b>"metacharacters"</b>. One such metacharacter is the dot <b> "." </b>, or wildcard. When used in a regular expression, "." can match any single character.</p>



In [38]:
import re
p = re.compile('g..dbye')   # should find all but the goooooodbye         
match_object = p.findall('goodbye cruel world, goodbye, goooooooodbye') 
print match_object


['goodbye', 'goodbye']


<h2>Quantifiers in Regular Expressions</h2>
Quantifiers specify how many instances of the preceeding element (which can be a character or a group) must appear in order to match.

<h2>Quantifiers for Literals and Wild Cards</h2> 

<h3>"?" - The Question Mark Quantifier</h3>
<p>The "?" matches 0 or 1 instances of the previous element. In other words, it makes the element optional; it can be present, but it doesn't have to be. For example:<p>


In [49]:
regex_str = 'colou?r'
p = re.compile(regex_str)   # Makes u optional.

search_str = 'color'
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()
    
    
search_str = 'colour'
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()

search_str = 'colur'
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()    

regex = colou?r MATCHES color
color
regex = colou?r MATCHES colour
colour
regex = colou?r DOESN'T matches colur


<h2>"*" - Asterik Quantifier</h2>

<p>The "*" matches 0 or more instances of the previous element. For example:
</p>


In [54]:
regex_str = 'www\.my.*\.com'  # will match any url of patter www.my<anything including no characters>.com.
                              # here the asterik is modifying the "." operator.  Notice also the use 
                              # of escape character "\" to turn "." character into a litteral rather
                              # than a wildcard.

p = re.compile(regex_str)   

search_str = 'www.my.com' # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()
    
    
search_str = 'www.mypage.com' # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()

search_str = 'www.mysite.com then text with spaces ftp.example.com' # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()   
    
    
search_str = 'www.oursite.com' # won't match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()  
    
search_str = 'mypage.com' # won't match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()      

regex = www\.my.*\.com MATCHES www.my.com
www.my.com
regex = www\.my.*\.com MATCHES www.mypage.com
www.mypage.com
regex = www\.my.*\.com MATCHES www.mysite.com then text with spaces ftp.example.com
www.mysite.com then text with spaces ftp.example.com
regex = www\.my.*\.com DOESN'T matches www.oursite.com
regex = www\.my.*\.com DOESN'T matches mypage.com


<p>As the third match illustrates, using ".*" can be dangerous. It will match any number of any character (including spaces and non alphanumeric characters). The quantifier is "greedy" and will match as much text as possible.</p>

<h2>"+" - Plus Sign Quantifier</h2>

<p>The "*" matches 1 or more instances of the previous element. For example:
</p>


In [57]:
regex_str = 'bob5+@foo\.com'  # will match any url of the form bob<any sequence of one or more 5's>.com 

p = re.compile(regex_str)   

search_str = 'bob5@foo.com'  # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()
    
    
search_str = 'bob5555@foo.com' # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()

search_str = 'bob@foo.com' # won't match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()   
    
    
search_str = 'bob65555@foo.com' # won't match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()  
    

regex = bob5+@foo\.com MATCHES bob5@foo.com
bob5@foo.com
regex = bob5+@foo\.com MATCHES bob5555@foo.com
bob5555@foo.com
regex = bob5+@foo\.com DOESN'T matches bob@foo.com
regex = bob5+@foo\.com DOESN'T matches bob65555@foo.com


<h3> "{}" - Braces Quantifier, Matches a Specified Number of Instances of a Search Pattern </h3>

<p>To match a character a specific number of times, add that number enclosed in curly braces after the element. For example:</p>

In [58]:
regex_str = 'w{3}\.mydomain\.com'  # will match any any www.mydomain.com

p = re.compile(regex_str)   

search_str = 'www.mydomain.com'  # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()
    
    
search_str = 'web.mydomain.com' # won't match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()

search_str = 'w3.mydomain.com' # won't match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()   
    

regex = w{3}\.mydomain\.com MATCHES www.mydomain.com
www.mydomain.com
regex = w{3}\.mydomain\.com DOESN'T matches web.mydomain.com
regex = w{3}\.mydomain\.com DOESN'T matches w3.mydomain.com


<h3> "{min, max}" - Using ranges of Matches. </h3>

<p>To specify the minimum number of matches to find and the maximum number of matches to allow, use a number range inside curly braces. For example:</p>

In [59]:
regex_str = '60{3,5} years'  # will match 6000, 60000 and 600000.

p = re.compile(regex_str)   

search_str = '6000 years'  # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()
    
    
search_str = '60000 years' # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()

search_str = '600000 years' # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()    
    
search_str = '60 years' # won't match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()  

search_str = '6000000 years' # won't match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()  

regex = 60{3,5} years MATCHES 6000 years
6000 years
regex = 60{3,5} years MATCHES 60000 years
60000 years
regex = 60{3,5} years MATCHES 600000 years
600000 years
regex = 60{3,5} years DOESN'T matches 60 years
regex = 60{3,5} years DOESN'T matches 6000000 years


<h2> Quantifier Summary </h2>

<table>
<th>Qualifier</th><th>Descriptions</th>
<tr>
<td>? </td><td>Matches any preceding element 0 or 1 times.</td>
</tr>
<tr>
<td>*</td><td>Matches the preceding element 0 or more times.</td>
</tr>
<tr>
<td>+</td><td>Matches the preceding element 1 or more times.</td>
</tr>
<tr>
<td>{num}</td><td>Matches the preceding element num times.</td>
</tr>
<td>{min,max}</td><td>Matches the preceding element at least min times, but not more than max times.</td>
</tr>
</table>



<h2>"|" The Aternation Operator</h2>

<p>The vertical bar "|" is used to represent an "OR" condition. Use it to separate alternate patterns or characters for matching. For example:</p>

In [62]:
regex_str = 'perl|python'  # will match 6000, 60000 and 600000.

p = re.compile(regex_str)   

search_str = 'perl'  # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()


search_str = 'python'  # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()
    
search_str = 'pearl'  # will NOT match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()    

regex = perl|python MATCHES perl
perl
regex = perl|python MATCHES python
python
regex = perl|python DOESN'T matches pearl


<h2>"()" - Grouping with Parentheses</h2>
<p>Parentheses "()" are used to group characters and expressions within larger, more complex regular expressions. Quantifiers that immediately follow the group apply to the whole group. For example:</p>

In [77]:
regex_str = '(abc){2,3}'  # match strings with at least two, but no more than three substrings of 'abc'

p = re.compile(regex_str)   

search_str = 'abc'  # won't match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()

    
search_str = 'abcabcxxxxx'  # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()

search_str = 'abcxxxxxabcxxxx'  # won't match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()

search_str = 'abcabcabcxxxxxabcxxxx'  # won't match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()
    
search_str = 'abcabcabcabcabc'  # This matches but I'm not sure why?????
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()
        
search_str = 'xxxxabcabcabcxxxxabcxxx'  # won't match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()       

regex = (abc){2,3} DOESN'T matches abc
regex = (abc){2,3} MATCHES abcabcxxxxx
abcabc
regex = (abc){2,3} DOESN'T matches abcxxxxxabcxxxx
regex = (abc){2,3} MATCHES abcabcabcxxxxxabcxxxx
abcabcabc
regex = (abc){2,3} MATCHES abcabcabcabcabc
abcabcabc
regex = (abc){2,3} DOESN'T matches xxxxabcabcabcxxxxabcxxx


<h2>Using Grouping with Alternation</h2>

<p>Groups can be used in conjunction with alternation. For example:</p>


In [80]:
regex_str = 'gr(a|e)y'  # match gray or grey

p = re.compile(regex_str)   

search_str = 'gray'  # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()

    
search_str = 'grey'  # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()

search_str = 'greay'  # will not match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()    

regex = gr(a|e)y MATCHES gray
gray
regex = gr(a|e)y MATCHES grey
grey
regex = gr(a|e)y DOESN'T matches greay


<p>Strings that match these groups are stored, or "delimited", for use in substitutions or subsequent statements. The first group is stored in the metacharacter "\1", the second in "\2" and so on. For example:</p>


In [91]:
regex_str = r'(.{2,5}) (.{2,8}) <\1_\2@example\.com>'  # match gray or grey

p = re.compile(regex_str)   

search_str = 'Joe Smith <Joe_Smith@example.com>'  # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()

    
search_str = 'jane doe <jane_doe@example.com>'  # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()

search_str = '459 33154 <459_33154@example.com>'  # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()  
    
    
search_str = 'joe Smith <Joe_Smith@example.com>'  # won't match because joe won't match Joe
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()   
    
    
search_str = 'Joseph Smith <Joseph_Smith@example.com>'  # won't match because Joseph is longer than 5 characters.
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()  
    
    
search_str = 'Joses Smith <Joses_Smith@example.com>'  # will match because Joses is 5 characters.
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()   
    
search_str = 'Joe S <Joe_S@example.com>'  # doen't match because S is less than 2 charcaters long.
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()      

regex = (.{2,5}) (.{2,8}) <\1_\2@example\.com> MATCHES Joe Smith <Joe_Smith@example.com>
Joe Smith <Joe_Smith@example.com>
regex = (.{2,5}) (.{2,8}) <\1_\2@example\.com> MATCHES jane doe <jane_doe@example.com>
jane doe <jane_doe@example.com>
regex = (.{2,5}) (.{2,8}) <\1_\2@example\.com> MATCHES 459 33154 <459_33154@example.com>
459 33154 <459_33154@example.com>
regex = (.{2,5}) (.{2,8}) <\1_\2@example\.com> DOESN'T matches joe Smith <Joe_Smith@example.com>
regex = (.{2,5}) (.{2,8}) <\1_\2@example\.com> DOESN'T matches Joseph Smith <Joseph_Smith@example.com>
regex = (.{2,5}) (.{2,8}) <\1_\2@example\.com> MATCHES Joses Smith <Joses_Smith@example.com>
Joses Smith <Joses_Smith@example.com>
regex = (.{2,5}) (.{2,8}) <\1_\2@example\.com> DOESN'T matches Joe S <Joe_S@example.com>


<h2>"[]" - Character Class Operator</h2>

<p>Character classes indicate a set of characters to match. Enclosing a set of characters in square brackets "[...]" means "match any one of these characters". For example:</p>

In [94]:
regex_str = '[cbe]at' # match cat,bat and eat.

p = re.compile(regex_str)   

search_str = 'cat'  # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()

search_str = 'eat'  # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()
    
search_str = 'bat'  # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()  
    
    
search_str = 'rat'  # will NOT match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()       

regex = [cbe]at MATCHES cat
cat
regex = [cbe]at MATCHES eat
eat
regex = [cbe]at MATCHES bat
bat
regex = [cbe]at DOESN'T matches rat


<h3>Using Character Classes with Quantifiers</h3>

<p>Since a character class on it's own only applies to one character in the match, combine it with a quantifier to search for multiple instances of the class. For example:</p>

In [103]:
regex_str = '[0123456789]{3}' # will match 3 character numeric strings.

p = re.compile(regex_str)   

search_str = '313'  # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()

search_str = '999'  # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()
    
search_str = '376abc'  # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()  

    
    
search_str = 'W3C'  # will not match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()   

search_str = '12 34 578' # will not match
match_object = p.findall(search_str) 
if match_object is None:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object

search_str = '1234578' # will not match
match_object = p.findall(search_str) 
if match_object is None:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object


regex = [0123456789]{3} MATCHES 313
313
regex = [0123456789]{3} MATCHES 999
999
regex = [0123456789]{3} MATCHES 376abc
376
regex = [0123456789]{3} DOESN'T matches W3C
regex = [0123456789]{3} MATCHES 12 34 578
['578']
regex = [0123456789]{3} MATCHES 1234578
['123', '457']


<h3> "[a-zA-z]" - Using Ranges of AlphaNumerics</h3>
<p>If we were to try the same thing with letters, we would have to enter all 26 letters in upper and lower case. Fortunately, we can specify a range instead using a hyphen. For example:</p>

In [106]:
regex_str = '[a-zA-z]{4}' # will match 4 character numeric strings.

p = re.compile(regex_str)   

search_str = 'Paul'  # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()

search_str = 'Paul Olsztyn'  # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()

search_str = 'Tad Olsztyn'  # will Not match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()

search_str = 'Pau7 Olsztyn'  # will Not match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()


regex = [a-zA-z]{4} MATCHES Paul
Paul
regex = [a-zA-z]{4} MATCHES Paul Olsztyn
Paul
regex = [a-zA-z]{4} DOESN'T matches Tad Olsztyn
regex = [a-zA-z]{4} DOESN'T matches Pau7 Olsztyn


<h3>"\w \d" Special Patterns for Most Commonly used Character Classes</h3>
<p>Most languages have special patterns for representing the most commonly used character classes. For example, Python uses "\d" to represent any digit (same as "[0-9]") and "\w" to represent any alphanumeric, or "word" character (same as "[a-zA-Z_]"). See your language documentation for the special sequences applicable to the language you use.</p>

In [109]:
regex_str = r'[\d]{3}-[\d]{2}-[\d]{4}' # will match valid social security numbers

p = re.compile(regex_str)   

search_str = '123-45-6789'  # will match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()

search_str = 'abc-45-6789'  # won't match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()

search_str = '123 45 6789'  # won't match
match_object = p.match(search_str) 
if match_object is None:
    print 'regex = ' + regex_str + " DOESN'T matches " + search_str
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object.group()
    
    


regex = [\d]{3}-[\d]{2}-[\d]{4} MATCHES 123-45-6789
123-45-6789
regex = [\d]{3}-[\d]{2}-[\d]{4} DOESN'T matches abc-45-6789
regex = [\d]{3}-[\d]{2}-[\d]{4} DOESN'T matches 123 45 6789


In [121]:
regex_str = r'[\w]* ' # will find all alphanumeric words.

p = re.compile(regex_str)   

search_str = '12 For score and seven years ago may the 4orce be with you' # will match
match_object = p.findall(search_str) 
if match_object is None:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object


regex = [\w]*  MATCHES 12 For score and seven years ago may the 4orce be with you
['12 ', 'For ', 'score ', 'and ', 'seven ', 'years ', 'ago ', 'may ', 'the ', '4orce ', 'be ', 'with ']


<h3>"^" - Negated Character Classes</h3>

<p>To define a group of characters you do not want to match, use a negated character class. Adding a caret "^" to the beginning of the character class (i.e. [^...]) means "match any character except these". For example:</p>


In [142]:
regex_str = '[^a-zA-Z]{4}' # will find 4 digit strings with no alphabetic characters.

p = re.compile(regex_str)   

search_str = '1234' # will match
match_object = p.findall(search_str) 
if match_object is None:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object

search_str = '$.25' # will match
match_object = p.findall(search_str) 
if match_object is None:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object

search_str = '#77;' # will match
match_object = p.findall(search_str) 
if match_object is None:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object

search_str = 'Perl' # won't match
match_object = p.findall(search_str) 
if match_object is None or len(match_object) == 0 :
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object   
    
search_str = 'AT&T' # won't match
match_object = p.findall(search_str) 
if match_object is None or len(match_object) == 0 :
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object    

regex = [^a-zA-Z]{4} MATCHES 1234
['1234']
regex = [^a-zA-Z]{4} MATCHES $.25
['$.25']
regex = [^a-zA-Z]{4} MATCHES #77;
['#77;']
No match found
No match found


<h2>Predefined Character Sequences</h2>

<table>
<th>Special Character</th><th>Represents</th>

<tr>
<td>\d</td><td>Matches any decimal digit; this is equivalent to the class [0-9].</td>
</tr>

<tr>
<td>\D</td><td>Matches any non-digit character; this is equivalent to the class [^0-9]</td>
</tr>

<tr>
<td>\s</td><td>Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].</td>
</tr>

<tr>
<td>\S</td><td>Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].</td>
</tr>

<tr>
<td>\w</td><td>Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].</td>
</tr>

<tr>
<td>\W</td><td>Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].</td>
</tr>
</table>

In [133]:

# Matches any decimal digit; this is equivalent to the class [0-9].
regex_str = r'[\D]' 

p = re.compile(regex_str)   

search_str = '12 For score and seven years ago, may the 4orce be with you;' # will match
match_object = p.findall(search_str) 
if match_object is None:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object

    regex_str = r'[\D]* ' # will find all alphanumeric words.

# Matches any non-digit character; this is equivalent to the class [^0-9]
regex_str = r'[\d]' 
p = re.compile(regex_str)   

search_str = '12 For score and seven years ago, may the 4orce be with you;' # will match
match_object = p.findall(search_str) 
if match_object is None:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object

# Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].    
regex_str = r'[\s]' 
p = re.compile(regex_str)   

search_str = '12 For score and seven years ago, may the 4orce be with you;' # will match
match_object = p.findall(search_str) 
if match_object is None:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object

# Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
regex_str = r'[\S]' 
p = re.compile(regex_str)   

search_str = '12 For score and seven years ago, may the 4orce be with you;' # will match
match_object = p.findall(search_str) 
if match_object is None:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object

# 	Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9].
regex_str = r'[\w]' 
p = re.compile(regex_str)   
    
search_str = '12 For score and seven years ago, may the 4orce be with you;' # will match
match_object = p.findall(search_str) 
if match_object is None:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object

#Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9].
regex_str = r'[\W]' 
p = re.compile(regex_str)   
    
search_str = '12 For score and seven years ago, may the 4orce be with you;' 
match_object = p.findall(search_str) 
if match_object is None:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object    
    

regex = [\D] MATCHES 12 For score and seven years ago, may the 4orce be with you;
[' ', 'F', 'o', 'r', ' ', 's', 'c', 'o', 'r', 'e', ' ', 'a', 'n', 'd', ' ', 's', 'e', 'v', 'e', 'n', ' ', 'y', 'e', 'a', 'r', 's', ' ', 'a', 'g', 'o', ',', ' ', 'm', 'a', 'y', ' ', 't', 'h', 'e', ' ', 'o', 'r', 'c', 'e', ' ', 'b', 'e', ' ', 'w', 'i', 't', 'h', ' ', 'y', 'o', 'u', ';']
regex = [\d] MATCHES 12 For score and seven years ago, may the 4orce be with you;
['1', '2', '4']
regex = [\s] MATCHES 12 For score and seven years ago, may the 4orce be with you;
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
regex = [\S] MATCHES 12 For score and seven years ago, may the 4orce be with you;
['1', '2', 'F', 'o', 'r', 's', 'c', 'o', 'r', 'e', 'a', 'n', 'd', 's', 'e', 'v', 'e', 'n', 'y', 'e', 'a', 'r', 's', 'a', 'g', 'o', ',', 'm', 'a', 'y', 't', 'h', 'e', '4', 'o', 'r', 'c', 'e', 'b', 'e', 'w', 'i', 't', 'h', 'y', 'o', 'u', ';']
regex = [\w] MATCHES 12 For score and seven years ago, may the 4orce

<h2>"$ ^" - Anchors: Matching at Specific Locations</h2>
<p>Anchors are used to specify where in a string or line to look for a match. The "^" metacharacter (when not used at the beginning of a negated character class) specifies the beginning of the string or line, The "$" metacharacter specifies the end of a string or line.</p>

In [149]:

# will find lines that start with string From: root@server.
regex_str = r'^From: root@server\.*'  # 

p = re.compile(regex_str)   

search_str = 'From: root@server.example.com' # will match
match_object = p.findall(search_str) 
if match_object is None or len(match_object)== 0:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object

search_str = 'I got this From: root@server.example.com yesterday' # won't match
match_object = p.findall(search_str) 
if match_object is None or len(match_object)== 0:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object

    
regex_str = r'.*\/index.php$'  # will match lines that end in /index.php$

p = re.compile(regex_str)   

search_str = 'www.example.org/index.php' # will match
match_object = p.findall(search_str) 
if match_object is None or len(match_object)== 0:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object

search_str = 'the file is /tmp/index.php' # will match
match_object = p.findall(search_str) 
if match_object is None or len(match_object)== 0:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object

search_str = 'www.example.org/index.php?id=245' # will NOT match
match_object = p.findall(search_str) 
if match_object is None or len(match_object)== 0:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object
    
    
    
regex_str = r'^To: .*example.org$'  # will match lines that start with 'To: ' and and with 'example.org'

p = re.compile(regex_str)   

search_str = 'To: hr@example.net, qa@example.org' # will match
match_object = p.findall(search_str) 
if match_object is None or len(match_object)== 0:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object

search_str = 'To: qa@example.org, hr@example.net' # will match
match_object = p.findall(search_str) 
if match_object is None or len(match_object)== 0:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object
    

regex = ^From: root@server\.* MATCHES From: root@server.example.com
['From: root@server.']
No match found
regex = .*\/index.php$ MATCHES www.example.org/index.php
['www.example.org/index.php']
regex = .*\/index.php$ MATCHES the file is /tmp/index.php
['the file is /tmp/index.php']
No match found
regex = ^To: .*example.org$ MATCHES To: hr@example.net, qa@example.org
['To: hr@example.net, qa@example.org']
No match found


<h1>Modifying Strings</h1>

<p>Up to this point, we’ve simply performed searches against a static string. Regular expressions are also commonly used to modify strings in various ways, using the following pattern methods:<p>

<table>
<th>Method/Attribute</th><th>Purpose</th>
<tr>
<td>split()</td>
<td>Split the string into a list, splitting it wherever the RE matches</td>
</tr>
<tr>
<td>sub()</td><td>Find all substrings where the RE matches, and replace them with a different string</td>
</tr>
<tr>
<td>subn()</td><td>Does the same thing as sub(), but returns the new string and the number of replacements</td>
</tr>
</table>

<h3>Split() Method</h3>
<p>
The split() method of a pattern splits a string apart wherever the RE matches, returning a list of the pieces. It’s similar to the split() method of strings but provides much more generality in the delimiters that you can split by; split() only supports splitting by whitespace or by a fixed string. As you’d expect, there’s a module-level re.split() function, too.
</p>

<p>
split(string[, maxsplit=0])</p>
<p>
Split string by the matches of the regular expression. If capturing parentheses are used in the RE, then their contents will also be returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits are performed.
</p>

<p>
You can limit the number of splits made, by passing a value for maxsplit. When maxsplit is nonzero, at most maxsplit splits will be made, and the remainder of the string is returned as the final element of the list. In the following example, the delimiter is any sequence of non-alphanumeric characters.</p>


In [152]:
# will find lines that start with string From: root@server.
regex_str = r'---'  # 

p = re.compile(regex_str)   

search_str = 'Paul---Olsztyn---is---great' # will match
match_object = p.split(search_str) 
if match_object is None or len(match_object)== 0:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object


regex = --- MATCHES Paul---Olsztyn---is---great
['Paul', 'Olsztyn', 'is', 'great']


<h2>sub() Method</h2>

<p>Another common task is to find all the matches for a pattern, and replace them with a different string. The sub() method takes a replacement value, which can be either a string or a function, and the string to be processed.</p>


<p>.sub(replacement, string[, count=0])</p>
<p>
Returns the string obtained by replacing the leftmost non-overlapping occurrences of the RE in string by the replacement replacement. If the pattern isn’t found, string is returned unchanged.
</p>
<p>
The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. The default value of 0 means to replace all occurrences.
</p>


In [153]:
# will find lines that start with string From: root@server.
regex_str = r'---'  # 

p = re.compile(regex_str)   

search_str = 'Paul---Olsztyn---is---great' # will match
match_object = p.sub('!!!',search_str) 
if match_object is None or len(match_object)== 0:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object


regex = --- MATCHES Paul---Olsztyn---is---great
Paul!!!Olsztyn!!!is!!!great


<h2>Compliation Flags</h2>

<p>Compilation flags let you modify some aspects of how regular expressions work. Flags are available in the re module under two names, a long name such as IGNORECASE and a short, one-letter form such as I. (If you’re familiar with Perl’s pattern modifiers, the one-letter forms use the same letters; the short form of re.VERBOSE is re.X, for example.) Multiple flags can be specified by bitwise OR-ing them; re.I | re.M sets both the I and M flags, for example.
</p>
<p>
Here’s a table of the available flags, followed by a more detailed explanation of each one.</p>

<table>
<th>Flag</th><th>Meaning</th>

<tr>
<td>DOTALL, S</td><td>Make . match any character, including newlines</td>
</tr>

<tr>
<td>IGNORECASE, I</td><td>Do case-insensitive matches</td>
</tr>

<tr>
<td>LOCALE, L</td><td>Do a locale-aware match</td>
</tr>

<tr>
<td>MULTILINE, M</td><td>Multi-line matching, affecting ^ and $</td>
</tr>

<tr>
<td>VERBOSE, X</td><td>Enable verbose REs, which can be organized more cleanly and understandably.</td>
</tr>

<tr>
<td>UNICODE, U</td><td>Makes several escapes like \w, \b, \s and \d dependent on the Unicode character database.
I</td>
</tr>
</table>

In [157]:
regex_str = r'PaUl'  # 

p = re.compile(regex_str) # won't match below because case doesn't match.  

search_str = 'Paul---Olsztyn---is---great' # will match
match_object = p.findall(search_str) 
if match_object is None or len(match_object)== 0:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object


p = re.compile(regex_str,re.I)   # 

search_str = 'Paul---Olsztyn---is---great' # will match because we specified IGNORECASE in regular
                                            # expression modifier
match_object = p.findall(search_str) 
if match_object is None or len(match_object)== 0:
    print "No match found"
else:
    print 'regex = ' + regex_str + " MATCHES " + search_str
    print match_object


No match found
regex = PaUl MATCHES Paul---Olsztyn---is---great
['Paul']
