# Regular Expression Review

## How to construct and debug regular expressions

The next cell defines a large html string that we will use to test some of regular expressions.  When writing a program that is going to depend on accurately extracting instances of certain patterns from text or HTML, you need to create the regular expressions first, testing them on realistic example strings.  You need your expressions to do two things:

1. Match the strings you trying to extract, and possibly some context around them, to guarantee you
   are extracting the right information;
2. If your expression matches context as well as the information you are trying to extract,
   (and often it will have to) you need to identify the target part of  the expression.  This is done by placing the target part of 
   the pattern in parentheses (illustrated below).
   
The homework assignment asks you to extract the baby name year in the html file.  The line containing the relevant information looks like this
     
     <h3 align="center">Popularity in 1990</h3>
     
One regular expression that will match the year is the following:

       '\d\d\d\d'

The code below tries out this idea.  Evaluate it and report on the  success of the idea in the markdown cell below the code cell.  

In [1]:
import re

html_string = """
<head><title>Popular Baby Names</title>
<meta name="dc.language" scheme="ISO639-2" content="eng">
<meta name="dc.creator" content="OACT">
<meta name="lead_content_manager" content="JeffK">
<meta name="coder" content="JeffK">
<meta name="dc.date.reviewed" scheme="ISO8601" content="2005-12-30">
<link rel="stylesheet" href="../OACT/templatefiles/master.css" type="text/css" media="screen">
<link rel="stylesheet" href="../OACT/templatefiles/custom.css" type="text/css" media="screen">
<link rel="stylesheet" href="../OACT/templatefiles/print.css" type="text/css" media="print">
</head>
<body bgcolor="#ffffff" text="#000000" topmargin="1" leftmargin="0">
<table width="100%" border="0" cellspacing="0" cellpadding="4">
  <tbody>
  <tr><td class="sstop" valign="bottom" align="left" width="25%">
      Social Security Online
    </td><td valign="bottom" class="titletext">
      <!-- sitetitle -->Popular Baby Names
    </td>
  </tr>
  <tr bgcolor="#333366"><td colspan="2" height="2"></td></tr>
  <tr><td class="graystars" width="25%" valign="top">
       <a href="../OACT/babynames/">Popular Baby Names</a></td><td valign="top"> 
      <a href="http://www.ssa.gov/"><img src="/templateimages/tinylogo.gif"
      width="52" height="47" align="left"
      alt="SSA logo: link to Social Security home page" border="0"></a><a name="content"></a>
      <h1>Popular Names by Birth Year</h1>September 12, 2007</td>
  </tr>
  <tr bgcolor="#333366"><td colspan="2" height="1"></td></tr>
</tbody></table>
<table width="100%" border="0" cellspacing="0" cellpadding="4" summary="formatting">
  <tr valign="top"><td width="25%" class="greycell">
      <a href="../OACT/babynames/background.html">Background information</a>
      <p><br />
      &nbsp; Select another <label for="yob">year of birth</label>?<br />      
      <form method="post" action="/cgi-bin/popularnames.cgi">
      &nbsp; <input type="text" name="year" id="yob" size="4" value="1990">
      <input type="hidden" name="top" value="1000">
      <input type="hidden" name="number" value="">
      &nbsp; <input type="submit" value="   Go  "></form>
    </td><td>
<h3 align="center">Popularity in 1990</h3>
<p align="center">
"""
re1 = r'\d\d\d\d'
re1_revised = r'[12]\d\d\d'
match = re.search(re1,html_string)
match_two = re.search(re1_revised,html_string)
# match object tells you positions in string where match begins and ends (match.start() and match.end()).  
# Let's look at  this span

#match = None
#match_two = None
if match:
   print html_string[match.start():match.end()]
if match_two:
   print html_string[match_two.start():match_two.end()]
   print match_two.group()

8601
2005
2005


Discuss how well this regular expresion worked at extracting the year. If it failed, explain why.
You may edit this cell.

    The first expression 'match' only matched the first 4 sequential digits in the html_string, giving the incorrect return of '8601' which actually refers to the ISO number. The second expression 'match_two' only matched the first 4 digit number that started with a '1' or '2' giving a result of '2005' which is an attribute of the metadata date. In order to distinguish the correct digit sequence, you must indicate the preceding determiner to the correct year, thus using r'Popularity in ([1-3][0-9]{3})'.

This exercise should have convinced you needed to amend the regular expression to provide some contexts; 4 digits in a row, even if the first is required to be 1 or 2, won't do it.  In the next cell, define and test a new regular expression that does
the job. You may want to try some of the exercises in the following sections first, to get some practice with regular expressions.

In [2]:
# Simplified expression:
# re.search(r"Popularity in ([1-3][0-9]{3})", html_string).group(1)

rel_improved = r'Popularity in ([1-3][0-9]{3})'
match_three = re.search(rel_improved, html_string)
if match_three:
    print html_string[match_three.start():match_three.end()]
    print match_three.group(1)

Popularity in 1990
1990


For the next html string, you want to find ALL the triples of the form RANK, MALE NAME, FEMALE NAME.
Your output should look like this:

   [('1', 'Jacob', 'Emma'), ('2', 'Michael', 'Isabella'), ('3', 'Ethan', 'Emily')]
   
You can get this using `re.findall`.  The next cell gives you a pretty helpful example of how to use it.

In [3]:
import re
html_str2 = """<tr align="center" valign="bottom">
  <th scope="col" width="12%" bgcolor="#efefef">Rank</th>
  <th scope="col" width="41%" bgcolor="#99ccff">Male name</th>
<th scope="col" bgcolor="pink" width="41%">Female name</th></tr>
<tr align="right"><td>1</td><td>Jacob</td><td>Emma</td>
</tr>
<tr align="right"><td>2</td><td>Michael</td><td>Isabella</td>
</tr>
<tr align="right"><td>3</td><td>Ethan</td><td>Emily</td>
</tr>"""
res1 = re.findall(r'<tr\s+.+><td>\d+</td>',html_str2)
res2 = re.findall(r'<tr\s+.+><td>(\d+)</td>',html_str2)

(res1, res2)
                

(['<tr align="right"><td>1</td>',
  '<tr align="right"><td>2</td>',
  '<tr align="right"><td>3</td>'],
 ['1', '2', '3'])

Notice the very different results you get with very similar `findall` requests.  The function `findall` is written so as to retrieve the **groups** in your regular expression. The groups in your regular expression are defined by parentheses.  If there are no groups (no parentheses), `findall` returns a list of complete matches.  So the first result above is what you get for a regular expression with no groups, and the second is what you get for a regular expression with one group.  If your regular expression contains multiple groups, you get a list of tuples.  Each tuple member corresponds to one group in the pattern.  Since you're being asked for a result that is a list of triples, you want a regular expression with 3 groups.

In [4]:
res3 = re.findall(r'<tr\s+.+><td>(\d+)</td><td>(\w+)</td><td>(\w+)</td>',html_str2)
print res3

[('1', 'Jacob', 'Emma'), ('2', 'Michael', 'Isabella'), ('3', 'Ethan', 'Emily')]


## Solving crosswords (requires NLTK)

The following example is adapted from [the NLTK Book, Ch. 3.](http://www.nltk.org/book/ch03.html)

Let's say we're in the midst of doing a cross word puzzle and we need an 8-letter word
whose third letter is *j* and whose sixth letter is *t* which means
*sad*.    We want
words that match the following regular expression pattern::

   '^..j..t..$'

Notice that this specifies a string of exactly 8 characters because
of the `^` and the `$`, which mark the beginning
and ending of the string, respectively.  Each `.` is a wildcard
which matches exactly one character but will match any character.


In [5]:
import re
from nltk.corpus import words
wds = words.words()
len(wds)
235786
cands = [w for w in wds if re.search('^..j..t..$',w)]
cands


[u'abjectly',
 u'adjuster',
 u'dejected',
 u'dejectly',
 u'injector',
 u'majestic',
 u'objectee',
 u'objector',
 u'rejecter',
 u'rejector',
 u'unjilted',
 u'unjolted',
 u'unjustly']

And now we check our list and there it is: *dejected*.
Will you ever be stumped by a crossword puzzle again?

## Textonyms

The [NLTK Book, Ch. 3](http://www.nltk.org/book/ch03.html>)introduces the following
concept of **textonym** with this definition:

   The T9 system is used for entering text on mobile phones: Two or more words that are 
   entered with the same sequence of keystrokes are known as textonyms. For example, 
   both *hole* and *golf* are entered by pressing the sequence `4653`. What other words could 
   be produced with the same sequence? 

   Here we  could use the regular expression `'^[ghi][mno][jlk][def]$'`.  

    >>> [w for w in wds if re.search('^[ghi][mno][jlk][def]$', w)]
    ['gold', 'golf', 'hold', 'hole']

Try the following.  Find all words that can be spelled out with the sequence
`3456`.

In [6]:
[w for w in wds if re.search('^[def][ghi][jkl][mno]$', w)]

[u'dilo', u'film', u'filo']

In [7]:
[w for w in wds if re.search('^[ghi][mno][jlk][def]$', w)]

[u'gold', u'golf', u'hold', u'hole', u'gold', u'hole']

## Regular expression practice

In [8]:
import re
pat = r'a|b|c'
pat2 = r'[abc]'
re.match(pat2,'b')

<_sre.SRE_Match at 0x11154ebf8>

Edit this cell and after each regular expression, describe the class of strings it matches.  Check your answer examining the output of the code cell that follows.

1.  [a-zA-Z]+
    A series of one or more uppercase and/or lowercase letters.
    
2.  [A-Z][a-z]*
    An uppercase letter followed zero or more lowercase letters.

3.  \d+(\.\d+)?
    This matches a string following a series of digits, which will contain a '.' and then a series of digits , essentially including the decimal and one or more digits to the right of the decimal. 

4.  ([bcdfghjklmnpqrstvwxyz][aeiou][bcdfghjklmnpqrstvwxyz])*
    This matches a lowercase consonant followed by a lowercase vowel followed by a consonant, which can all occur in sequence zero or more times. 


5.  \w+|[^\w\s]+ 
    This matches one or more alphanumeric characters in a series OR a series of non-alphanumeric and non-whitespace characters.


In [9]:
########################################
###     Some regular expressions     ###
########################################

re2 = r'[a-zA-Z]+'
re3 = r'[A-Z][a-z]+'
re4 = r'\d+(\.\d+)?'
re5 = r'([bcdfghjklmnpqrstvwxyz][aeiou][bcdfghjklmnpqrstvwxyz])*'
re6 = r'\w+|[^\w\s]+'
res = [re2,re3,re4,re5,re6]

########################################
###     Some example strings         ###
########################################

example1 = 'abracadabra'
example2 = '1billygoat'
example3 = 'billygoat1'
example4 = '43.1789'
example5 = '43.'
example6 = '43'
example7 = 'road_runner'
example8 = ' road_runner'
example9 = 'bathos'
example10 = "The little dog laughed to see such a sight."
example11 = 'socrates'
example12 = 'Socrates'
example13 = '*&%#!?'
example14 = 'IBM'

examples = [example1,example2,example3,example4,example5,example6,
            example7,example8,example9,example10,example11,example12,example13,
            example14]

########################################
###     Trying some matches          ###
########################################

for i,re_pat in enumerate(res):
    banner = 're%d %s' % (i+2,re_pat)
    print 
    print banner
    print '=' * len(banner)
    print
    for (i,ex) in enumerate(examples):
        match = re.match(re_pat,ex)
        if match:
            print '  %2d. %-45s  %s' % (i+1,ex,ex[match.start():match.end()])
        else:
            print '  %2d. %-45s  %s' %(i+1,ex,None)
            


re2 [a-zA-Z]+

   1. abracadabra                                    abracadabra
   2. 1billygoat                                     None
   3. billygoat1                                     billygoat
   4. 43.1789                                        None
   5. 43.                                            None
   6. 43                                             None
   7. road_runner                                    road
   8.  road_runner                                   None
   9. bathos                                         bathos
  10. The little dog laughed to see such a sight.    The
  11. socrates                                       socrates
  12. Socrates                                       Socrates
  13. *&%#!?                                         None
  14. IBM                                            IBM

re3 [A-Z][a-z]+

   1. abracadabra                                    None
   2. 1billygoat                                     None
   3. billygoat1  

Make sure you can answer the following questions about the results of testing these regular expressions on the examples:

1. Why does `re2` fail on `example8`?
    It starts with a whitespace in its character series. 

2. Why does `re3` only succeed on `example10` and `example12`?  Be sure to explain why it fails on `example14`.
    The series must begin with an uppercase letter followed by one or more lower case letters. 'example14' fails still since the uppercase letter is followed by another uppercase letter, instead of a series of lowercase letters. 
   
3. When 're4' matches 'example5', why isn't the decimal point part of the match?
    The decimal must be followed by a series of digits for it to be included, since the '?' character determines that that part of the match can occur either zero or one time, meaning it can either include the prior digits and both the decimal and its following digits, or just the digits prior to the decimal without the decimal included in the match. 

4. All of the regular expressions except `re5` report a `None` with at least one of the examples.  Why doesn't `re5` report any `None`s?
    The entire expression is enclosed in parenthesis denoted by a '*' which means that the matched characters can occur zero or more times in sequence. When the expression doesn't find a match, it assumes zero instead of responding with a failure 'None'

5. Why does `re6` match all the characters in `example13`?
    're6' asks for a series of all alphanumeric characters OR a series of ALL non-alphanumeric and non-whitespace charatcers. 'example13' is all non-alphanumeric characters. 

6. Why doesnt `re6` match `example8`?
    The whitespace is the first character of 'example8' so that makes it fail to meet either criteria of 're6'


## An example that requires NLTK to be installed

   To run the code for this example, you will use a **balanced corpus**
   of English texts, a corpus collected with the purpose of representing
   a balanced variety of English text types: fiction, poetry, speech,
   non fiction, and so on.  One relatively well-established, free,
   and easy-to-get example of such a corpus is the **Brown Corpus.**
   Brown is about 1.2 M words. 
   
   You can import the corpus as follows::

     >>> from nltk.corpus import brown


   If this does not work, it is because you have nltk installed without the accompanying
   corpora. You can download any nltk corpus you need through the `nltk.download` function For example,
   to get the Brown corpus, do the following in Python::
      
      >>> import nltk
      >>> nltk.download()

   This brings up a window you can interact with.  There are some tabs
   at the top.  Choose the tab labeled *Corpora*,
   select **Brown**, and click the **download** button
   at the bottom of the window.   You will then have
   Brown on your machine and you can import the corpus as follows::

     >>> from nltk.corpus import brown

   The following returns a list of all 1.2 M word tokens in Brown::

     >>> brown.words()
     ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]



In [10]:
# From http://www.nltk.org/book/ch03.html
#  Find the most common vowel sequences in English.  Note: be patient.  Evaluating this may take a while.
from nltk.corpus import brown
from collections import Counter
bw = sorted(set(brown.words()))
# Find every instance of two or more consecutive vowels, and count tokens of each.
ctr = Counter(vs  for word in bw for vs in re.findall('[aeiou]{2,}',word)
              )
ctr.most_common(25)



[(u'io', 2787),
 (u'ea', 2249),
 (u'ou', 1855),
 (u'ie', 1799),
 (u'ia', 1400),
 (u'ee', 1289),
 (u'oo', 1174),
 (u'ai', 1145),
 (u'ue', 541),
 (u'au', 540),
 (u'ua', 502),
 (u'ei', 485),
 (u'ui', 483),
 (u'oa', 466),
 (u'oi', 412),
 (u'eo', 250),
 (u'iou', 225),
 (u'eu', 187),
 (u'oe', 181),
 (u'iu', 128),
 (u'ae', 85),
 (u'eau', 54),
 (u'uo', 53),
 (u'eou', 52),
 (u'uou', 37)]

## Poker examples

Suppose you are writing a poker program where a player’s hand is represented as a 5-character string with each character representing a card, “a” for ace, “k” for king, “q” for queen, “j” for jack, “t” for 10, and “2” through “9” representing the card with that value.

To see if a given string is a valid hand, one could run the code in t he following cell

In [11]:
import re
def displaymatch(regex,text):
    match = regex.match(text)
    if match is None:
        matchstring = None
    else:
        matchstring = '%s[%s]%s' % (text[:match.start()],text[match.start():match.end()],text[match.end():])
    print '%-10s %s' % (text,matchstring)

valid = re.compile(r"^[a2-9tjqk]{5}$")

## Some examples
displaymatch(valid, "akt5q")  # Valid.
displaymatch(valid, "akt5e")  # Invalid.
displaymatch(valid, "akt")    # Invalid.
displaymatch(valid, "727ak")  # Valid.
displaymatch(valid, "727aka")  # Invalid.


akt5q      [akt5q]
akt5e      None
akt        None
727ak      [727ak]
727aka     None


The hand "727ak" contains a pair, and we would like to recognize such hands as special, so that we can go all in.  We can do this using regular expression groups and register references.  The match for each parenthesized part of a regular expression is called a **group**.  We can refer back to the particular match  associated with a group with \integer.  Where integer is any integer from 1 through 9.  \1 refers to the first group, \2 to the second, and so on.  So to match poker hands with pairs, we do the following.

In [12]:
pair = re.compile(r".*(.).*\1.*")
displaymatch(pair,"727ak")
displaymatch(pair,"723ak")
pair.match("717ak").groups()[0]

727ak      [727ak]
723ak      None


'7'

In [13]:
displaymatch(pair,"a2aak")
pair.match("aa2ak").groups()[0]

a2aak      [a2aak]


'a'

In [14]:
two_pair = re.compile(r"(?:.*(.).*(.).*\1.*\2.*)|(?:.*(.).*(.).*\4.*\3.*)|(?:.*(.).*\5.*(.).*\6.*)")

#two_pair = re.compile(r".*(.).*(.).*\2.*\1.*")
displaymatch(two_pair,"a2kak")
displaymatch(two_pair,"a2kka")
displaymatch(two_pair,"aa2kk")
displaymatch(two_pair,"aa2aq")
displaymatch(two_pair,"aaaaq")
four_of_a_kind = re.compile(r".*(.).*\1.*\1.*\1.*")
four_of_a_kind_alt = re.compile(r".?(.).?\1.?\1.?\1.?")
displaymatch(four_of_a_kind,"akaaa")
displaymatch(four_of_a_kind,"kaaaa")
displaymatch(four_of_a_kind,"aakaa")
displaymatch(four_of_a_kind,"aaaka")
displaymatch(four_of_a_kind,"aaaak")

a2kak      [a2kak]
a2kka      [a2kka]
aa2kk      [aa2kk]
aa2aq      None
aaaaq      [aaaaq]
akaaa      [akaaa]
kaaaa      [kaaaa]
aakaa      [aakaa]
aaaka      [aaaka]
aaaak      [aaaak]


Of course, the regex `pair` does not require the text string to be a Poker hand.  We could revise it to do that and if you think about it a little, it would actually make the regex  **a lot** more complicated.  What we could do instead is first apply `valid` to guarantee we've got a valid poker hand and then apply `pair` to find out if it contains a pair. This makes both regexes simple and easy to understand and still enforce all the constraints we want.  Often a good strategy in applying regexes to enforce some complicated constraints is to divide the constraints up into separate categories and apply them **in succession.**.  

A problem with `pair` is that it doesnt tell us  what we've got a pair of.  Actually, the match object contains this information.  It has an attribute called `groups` which contains all portions of the string that matched a group.  We can use a revised version of `displaymatch` to print this, when requested:

In [15]:
import re
def displaymatch(regex,text, print_groups=False):
    match = regex.match(text)
    if match is None:
        matchstring = None
    else:
        matchstring = '%s[%s]%s' % (text[:match.start()],text[match.start():match.end()],text[match.end():])
    if print_groups and match:
        print '%-10s %s %s' % (text,matchstring,match.groups())
    else:
        print '%-10s %s' % (text,matchstring)

# Re for recognizing pair hands
pair = re.compile(r".*(.).*\1")
displaymatch(pair,"723ak",print_groups=True)

## Write your regex for recognizing two pair below. Test
two_pair = re.compile(r"(?:.*(.).*(.).*\1.*\2.*)|(?:.*(.).*(.).*\4.*\3.*)|(?:.*(.).*\5.*(.).*\6.*)")

displaymatch(two_pair,"722a7",print_groups=True)
displaymatch(two_pair,"722ak",print_groups=True)  # Shd fail on this one
displaymatch(two_pair,"7a722",print_groups=True)
displaymatch(two_pair,"727a2",print_groups=True)
displaymatch(two_pair,"aaaa2",print_groups=True)  # Will succeed on this one, but that's ok

723ak      None
722a7      [722a7] (None, None, '7', '2', None, None)
722ak      None
7a722      [7a722] (None, None, None, None, '7', '2')
727a2      [727a2] ('7', '2', None, None, None, None)
aaaa2      [aaaa2] ('a', 'a', None, None, None, None)


## Questions

1.  Write regexes that match three-of-a-kind hands,  and four-of-a-kind hands.  Follow the model of `pairs` and dont bother to guarantee that it's a valid Poker hand.

    three_of_a_kind = re.compile(r".*(.).*\1.*\1.*")
    four_of_a_kind = re.compile(r".*(.).*\1.*\1.*\1.*")
    
    
2.  It's quite complex to write a regular expression that checks to see if you've got a straight, but you can try the following strategy.  First, verify you've got a valid poker hand; then verify you havent got a pair, three-of-kind, or four-of-a-kind.  So you have a valid poker hand with no repetitions and you dont need the regex that checks for straights to rule those out.
    Now write a regex that will check to see if a valid poker hand with no repetitions is a straight  beginning with '2'.  It should succeed on `23456` and `25643` and `32654` and it should fail `24357`.  To deal with all possible straights in this way, how many cases are there to take care of?  Write a single regular expression that will identify any straight, given that it is a valid poker hand with no repetitions.  Test it on the straights above and on straights like `akqjt` and on the non-straight `24357`.
    
    straight = re.compile(r"(2345A|23456|34567|45678|56789|6789t|789tj|89tjq|9tjqk|tjqka)\.*")
    
    
3.  Write a regex that matches a two pair hand. This is tricky and the most natural answer will also match four-of-a-kind. Assume we've eliminated that possibility by failing to match the four-of-kind pattern from 1.  You should test `722a7`, `7a722` and `727a2`.  You will need a pattern that is a big disjunction using `|`, and you will need to enclose the disjuncts of this big disjunction in parentheses, but for that purpose you will need parentheses that don't count as defining a retrievable group.  The notation for that is `(?:` instead of `(` [the same right paren is used in both cases]. See [Python regex docs.](http://docs.python.org/2/library/re.html)

    two_diff_pair = 

## How to do extraction

The following example is from `The weather underground page for San Diego <http://www.wunderground.com/weather-forecast/US/CA/San_Diego.html>`_.  The temperature is regularly given in a page division (HTML tag `div`) with ID (HTML attribute `divID`) `NowTemp`.  If we can find that division and the temperature inside it, we have what we want.  The pattern needs to be compiled with flags that allow it to match across multiple lines, because the context that identifies the temperature does not occur on the same line as the temperature.  Compiling regular expressions also makes them more efficient when reused.  A key point is that we place the actual temperature we want inside parentheses, the `(\d{1,3}\.\d)` part of the pattern.  Portions of a pattern that occur in parentheses and are matched are placed ins the `groups` attribute of  the match object.  The groups attribute is a tuple of all the matched strings in parentheses in the pattern.

In [16]:
import re
html_string = """
<div class="br10" id="stationSelect">
		<a class="br10" id="stationselector_button" href="javascript:void(0);" onclick="_gaq.push(['_trackEvent', 'Station Select', 'Opened']);"><span>Station Select</span></a>
		</div>
		</div>
		<div id="conds_dashboard">
		<div id="hour00">
		<div id="nowCond">
		<div class="titleSubtle">Now</div>
		<div id="curIcon"><a href="" class="iconSwitchBig"><img src="http://icons-ak.wxug.com/i/c/k/nt_partlycloudy.gif" width="44" height="44" alt="Scattered Clouds" class="condIcon" /></a></div>
		<div id="curCond">Scattered Clouds</div>
		</div>
		<div id="nowTemp">
		<div class="titleSubtle">Temperature</div>
		<div id="tempActual"><span id="rapidtemp" class="pwsrt" pwsid="KCASANDI123" pwsunit="english" pwsvariable="tempf" english="&deg;F" metric="&deg;C" value="55.8">
  <span class="nobr"><span class="b">55.8</span>&nbsp;&deg;F</span>
</span></div>
		<div id="tempFeel">Feels Like
  <span class="nobr"><span class="b">55.1</span>&nbsp;&deg;F</span>
</div>
		</div>
"""
pattern = r'<div\s+id\s*=\s*\"nowTemp\"\s*>.*?(\d{1,3}\.\d).*?</div>'
pattern_re = re.compile(pattern,re.MULTILINE | re.DOTALL)
#m = re.search(pattern_re,html_string)
#m.groups()
pattern_re.findall(html_string)

['55.8']

The pattern in the example above was built up piece by piece.  First we built a regular expression matching the `<div id="nowTemp">` part of the pattern.  That piece looked like this:
    
     subpattern = r'<div\s+id\s*=\s*\"nowTemp\"\s*>
 
 The `\s*` aren't needed for this particular string, but there is considerable variation in how actual HTML is generated, and since
 white space in the `\s*` positions wouldn't be meaningful, it is allowed.  Next we tested the core part of the pattern on its own:
 
     corepattern = r'(\d{1,3}\.\d)'
  
  Finally we tested the last part:
  
     lastpattern = r`</div>'

## Tokenization  (NLTK assumed)

Tokenization is the process of breaking up a text into words.  We have in some cases used `split()` for this purpose, uniformly splitting a text up into words on the spaces, but this doesn't always yield the right results, as the next examples show.

In [31]:
import sre_parse
def compile_regexp_to_noncapturing(pattern, flags=0):
    """
    Compile the regexp pattern after switching all grouping parentheses
    in the given regexp pattern to non-capturing groups.

    :type pattern: str
    :rtype: str
    """
    def convert_regexp_to_noncapturing_parsed(parsed_pattern):
        res_data = []
        for key, value in parsed_pattern.data:
            if key == sre_constants.SUBPATTERN:
                index, subpattern = value
                value = (None, convert_regexp_to_noncapturing_parsed(subpattern))
            elif key == sre_constants.GROUPREF:
                raise ValueError('Regular expressions with back-references are not supported: {0}'.format(pattern))
            res_data.append((key, value))
        parsed_pattern.data = res_data
        print(repr(parsed_pattern))
        print(repr(parsed_pattern.pattern))
        print(repr(parsed_pattern.pattern.groups))
        parsed_pattern.pattern.groups = 1
        parsed_pattern.pattern.groupdict = {}
        return parsed_pattern

    print(repr(pattern))
    print(repr(sre_parse.parse(pattern)))
    return sre_compile.compile(convert_regexp_to_noncapturing_parsed(sre_parse.parse(pattern)), flags=flags)

In [32]:
class Pattern:
    # master pattern object.  keeps track of global attributes
    def __init__(self):
        self.flags = 0
        self.open = []
        self.groups = 1
        self.groupdict = {}
    def opengroup(self, name=None):
        gid = self.groups
        self.groups = gid + 1
        if name is not None:
            ogid = self.groupdict.get(name, None)
            if ogid is not None:
                raise error("redefinition of group name %s as group %d; "
                            "was group %d" % (repr(name), gid,  ogid))
            self.groupdict[name] = gid
        self.open.append(gid)
        return gid
    def closegroup(self, gid):
        self.open.remove(gid)
    def checkgroup(self, gid):
        return gid < self.groups and gid not in self.open

In [42]:
# From http://www.nltk.org/book/ch03.html
import sre
import sre_parse
import sre_compile
import sre_constants
from __future__ import division
import nltk, re, pprint
from nltk import word_tokenize
#from nltk.tokenize.regexp import compile_regexp_to_noncapturing

text = """
"That," said  Fred, "is what you get in the U.S.A. for $5.29."
"""
try1 = text.split()

pattern = r""" 
   ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
  |\w+(-\w+)*        # words with optional internal hyphens
  |\$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
  |\.\.\.            # ellipsis
  |[][.,;"'?():-_`]  # these are separate tokens
"""
npattern_re = compile_regexp_to_noncapturing(pattern)
npattern_re = re.compile(npattern,re.UNICODE | re.MULTILINE | re.DOTALL |re.X)
try2 = npattern_re.findall(text)
import nltk
try3 = nltk.regexp_tokenize(text,npattern_re)

' \n   ([A-Z]\\.)+        # abbreviations, e.g. U.S.A.\n  |\\w+(-\\w+)*        # words with optional internal hyphens\n  |\\$?\\d+(\\.\\d+)?%?  # currency and percentages, e.g. $12.40, 82%\n  |\\.\\.\\.            # ellipsis\n  |[][.,;"\'?():-_`]  # these are separate tokens\n'
[('branch', (None, [[('literal', 32), ('literal', 10), ('literal', 32), ('literal', 32), ('literal', 32), ('max_repeat', (1, 4294967295L, [('subpattern', (1, [('in', [('range', (65, 90))]), ('literal', 46)]))])), ('literal', 32), ('literal', 32), ('literal', 32), ('literal', 32), ('literal', 32), ('literal', 32), ('literal', 32), ('literal', 32), ('literal', 35), ('literal', 32), ('literal', 97), ('literal', 98), ('literal', 98), ('literal', 114), ('literal', 101), ('literal', 118), ('literal', 105), ('literal', 97), ('literal', 116), ('literal', 105), ('literal', 111), ('literal', 110), ('literal', 115), ('literal', 44), ('literal', 32), ('literal', 101), ('any', None), ('literal', 103), ('any', None), ('litera

NameError: name 'npattern' is not defined

In [37]:
pattern = r"""([A-Za-z])+"""
m = re.match(pattern,'Hi')
m.group()

'Hi'

In [38]:
try1

['"That,"',
 'said',
 'Fred,',
 '"is',
 'what',
 'you',
 'get',
 'in',
 'the',
 'U.S.A.',
 'for',
 '$5.29."']

The `split` tokenized sentence has some very strange words, for example the 7-character strings `"Fred,"` and `"That,"`,  and the 3-character string `"is`. What's being missed here is that certain characters (like comma and qutation-mark) unambiguously mark a word boundary.  Regular expressions are very good at enforcing this sort of generalization, as we can see by comparing the results of tokenizing the same sentence with a regexp that does not allow words to continuew past boundary markers.

In [39]:
try2

NameError: name 'try2' is not defined

Python regular expressions use parentheses for two different things, defining retrievable groups, which as we saw, is useful for extraction, and defining the scope of some regular expression operator (like `*` or `+`). Sometimes these two roles get in each other's way.  This is what happens in `pattern` above: Python `findall` handles groups specially and incorrectly treats the parenthesized elements as groups; so we use the regular expression convention of changing `(` to '(?:'.  The "(?:' functions unambiguously to scope an operator and does not define a retrievable group.  Rather than make this change by hand, we call the convenient NLTK function `convert_regexp_to_nongrouping`.  We then compile the regular expression using various regular expression compiling flags.  `re.MULTILINE` and `re.DOTALL` allow our regular tokenizing `pattern` to match across lines, while `re.UNICODE` allows our definition of word, which depends on the interpretation of `\w` to apply to UNICODE characters.  Finally, `re.X` is the most directly relevant to this example.  This allows regular expressions that intersperse comments, which makes them much more readable.  See [Python.org re docs](http://docs.python.org/2/library/re.html) for more details.


## Sentence boundary detection

In [None]:
import re
text = """
The king rarely saw Marie 
on Tuesdays, but
he did see her  on Wednesdays.  He liked
to take long walks
in the garden, gazing at the
rhododendrns longingly.  She
thought this
odd.  Me, too.
"""
lines = re.split(r'\s*[!?.]\s*', text)


In [None]:
lines