### Practicing regular expressions

Python has a powerful library called regular expressions that handle these searching / extracting tasks elegantly. The syntax is a little different, so they're like they're own mini programming language. 

- re.search()
- re.findall() 
- ^ matches the beginning of the line
- $ matches the end of the line 
- \s matches a whitepsace character
- \S matches a non-whitespace character (opposite of \s) 
- * applies to the immediately preceding characters and indicates to match zero or more times
- *? applies to the immediately preceding characters and indicates to match zero or more times in a non-greedy mode
- + applies to the immediately preceding characters and indicates to match one or more times (this is like greedy) 
- +? applies to the immediately preceding characters and indicates to match one or more times in "non-greedy" mode
- ? -> preceding character and indicates to match zero or one time
- [aeiou] matches a single character as long as that character is in teh specified set so [aeiou] matches "a" , "e" , "i" etc but no other characters
- [a-z0-9] you can specify ranges of characters using the minus sign 
- [^A-Za-z] when the first character is ^ then it inverts the logic, this example matches a single character that is anything other than an uppercase or lowercase letter
- ( ) when parenthesis are added to a regular expression, they are ignored (the parenthesis, literally), but allow you to extract a particular subset of the matched string rather than the whole string when using findall()


In [8]:
import re 

hand = open('mbox-short.txt') 
for line in hand: 
    line = line.rstrip()
    if re.search('From:', line): 
        print(line) 

# could have used line.find() to accomplish the same result 
# the power comes when we need to use special characters like the caret ^
# The ^ character is used for regular expressions to match the beginning
# of a line... so it's much more specific. 

hand = open('mbox-short.txt')
for line in hand: 
    line = line.rstrip() 
    if re.search('^From:', line): 
        print(line) 

# Still, you could hve done this with startswith() method
# So this is contrived / simple example

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: mmmay@indiana.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: cwen@iupui.ed

In [9]:
# match any of the strings "From:", "Fxxm:", "F12m:" etc.. 
# Use '^...' to find these lines at the beginning! 

import re
hand = open('mbox-short.txt') 
for line in hand: 
    line = line.rstrip()
    if re.search('^F..m:', line): 
        print(line) 

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: mmmay@indiana.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: cwen@iupui.ed

In [10]:
# using the .+ wildcard
# Now the search string 'From:.+@' will match lines that start from 
# "From:", followed by one or more characters (.+), followed by an @-sign.

import re
hand = open('mbox-short.txt')
for line in hand: 
    line = line.rstrip()
    if re.search('^From:.+@', line): 
        print(line) 

# this catches things like ... 
# From: jason_yum@email.com
# You can see that it's "greedy" / "pushy" behavior in that it grabs
# extra strings, like the characters to the right of @, too. 

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: mmmay@indiana.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: cwen@iupui.ed

In [11]:
# Reviewed: 
# .rstrip() removes the trailing characters
# re is the regular expression library
# .search() takes two arguments
# first = what are you looking for
# second = where do you want to be looking
# if it's true (finds a match), then print(line) 

import re
hand = open('mbox-short.txt') 
for line in hand: 
    line = line.rstrip() 
    if re.search('^From:', line): 
        print(line) 

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: mmmay@indiana.edu
From: cwen@iupui.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: zqian@umich.edu
From: cwen@iupui.ed

In [12]:
import re
s = 'A message from skim@harvard.edu to jyum@brown.edu about meeting @2pm'
lst = re.findall('\S+@\S+',s) 
print(lst)

# The findall() method searches the string in the second argument... 
# and returns a list of all of the strings that look like email addresses
# \S => non-whitespace character 

['skim@harvard.edu', 'jyum@brown.edu']


In [13]:
# notice @2pm is not taken
# notice that @edu is not taken
# but b@i is taken because it's looking for at least one non-whitespace character! 

s = 'skim@harvard.edu to jyum@brown.edu about meeting @2pm but not from @com or ai@ but we taken b@i'
lst = re.findall('\S+@\S+',s) 
print(lst)

['skim@harvard.edu', 'jyum@brown.edu', 'b@i']


In [35]:
hand = open('mbox-short.txt')
for line in hand: 
    line = line.rstrip()
    x = re.findall('\S+@\S+', line) 
    if len(x) > 0: 
        print(x) 
        
# Why do we need another variable x? 
# ANS: it makes the code much clearer

['stephen.marquard@uct.ac.za']
['<postmaster@collab.sakaiproject.org>']
['<200801051412.m05ECIaH010327@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['stephen.marquard@uct.ac.za']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['stephen.marquard@uct.ac.za']
['louis@media.berkeley.edu']
['<postmaster@collab.sakaiproject.org>']
['<200801042308.m04N8v6O008125@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['louis@media.berkeley.edu']
['source@collab.sakaiproject.org']
['louis@media.berkeley.edu']
['louis@media.berkeley.edu']
['zqian@umich.edu']
['<postmaster@collab.sakaiproject.org>']
['<200801042109.m04L92hb007923@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject

['ian@caret.cam.ac.uk']
['source@collab.sakaiproject.org']
['ian@caret.cam.ac.uk']
['ian@caret.cam.ac.uk']
['stephen.marquard@uct.ac.za']
['<postmaster@collab.sakaiproject.org>']
['<200712240805.lBO85jQD027143@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['stephen.marquard@uct.ac.za']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['stephen.marquard@uct.ac.za']
['stephen.marquard@uct.ac.za']
['<postmaster@collab.sakaiproject.org>']
['<200712240705.lBO75Q96027085@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['stephen.marquard@uct.ac.za']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['stephen.marquard@uct.ac.za']
['ian@caret.cam.ac.uk']
['<postma

In [37]:
# you can see from the example above that ... there's some issues 
# The email addresses are pulling in values that include the ">" characters
# and there are also ";" getting pulled in. This is wrong. 
# big picture: the motif here is that we're looking to pull in 
# a specific value from the document, in this case the email addresses
# if we wanted to find a line where a word appeared we'd run the .search()

# Use [a-zA-Z0-9]\S*@\S*[a-zA-Z] 

# translation:
# [a-zA-Z0-9] => substrings that start with a single lowercase letter,
# uppercase letter, or a number... followed by a number

# \S*@\S* => 0 or more non-blank characters followed by an at-sign
# followed by zero or more non-blank characters (\S*)
# notice we switched from + to * to indicate zero or more non-blank 
# since [a-zA-Z0-9] is already one non-blank character

# [a-zA-Z] demands that the regular expression parser finds a string
# that must end with a letter

import re 
hand = open('mbox-short.txt') 
for line in hand: 
    line = line.rstrip()
    x = re.findall('[a-zA-Z0-9]\S+@\S+[a-zA-Z]', line) 
    if len(x) > 0: 
        print(x) 

['stephen.marquard@uct.ac.za']
['postmaster@collab.sakaiproject.org']
['200801051412.m05ECIaH010327@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['stephen.marquard@uct.ac.za']
['louis@media.berkeley.edu']
['postmaster@collab.sakaiproject.org']
['200801042308.m04N8v6O008125@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['louis@media.berkeley.edu']
['source@collab.sakaiproject.org']
['louis@media.berkeley.edu']
['louis@media.berkeley.edu']
['zqian@umich.edu']
['postmaster@collab.sakaiproject.org']
['200801042109.m04L92hb007923@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject

['source@collab.sakaiproject.org']
['csev@umich.edu']
['csev@umich.edu']
['csev@umich.edu']
['postmaster@collab.sakaiproject.org']
['200712271455.lBREtn2N031488@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['csev@umich.edu']
['source@collab.sakaiproject.org']
['csev@umich.edu']
['csev@umich.edu']
['csev@umich.edu']
['postmaster@collab.sakaiproject.org']
['200712271446.lBREkWvx031476@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['csev@umich.edu']
['source@collab.sakaiproject.org']
['csev@umich.edu']
['csev@umich.edu']
['csev@umich.edu']
['postmaster@collab.sakaiproject.org']
['200712271423.lBRENg6s031462@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@

In [44]:
# Aside: just a stupid example 

# we could use re.findall() to pull entire lines 
# but it's awkward and not clean in what it's doing... 

import re
hand = open('mbox-short.txt') 
for line in hand: 
    line = line.rstrip()
    murder = re.findall('murder',line)
    if len(murder) > 0: # so if you've found it 
        print(line) 
        

## A much better way is to simply run a search on the line 
# instead of using the re.findall
# re.findall() is more to extract out the value we care about 

import re
hand = open('mbox-short.txt') 
for line in hand: 
    line = line.rstrip() 
    if re.search('murder', line): 
        print(line)        
        


Received: from murder (mail.umich.edu [141.211.14.90])
Received: from murder ([unix socket])
Received: from murder (mail.umich.edu [141.211.14.97])
Received: from murder ([unix socket])
Received: from murder (mail.umich.edu [141.211.14.25])
Received: from murder ([unix socket])
Received: from murder (mail.umich.edu [141.211.14.25])
Received: from murder ([unix socket])
Received: from murder (mail.umich.edu [141.211.14.46])
Received: from murder ([unix socket])
Received: from murder (mail.umich.edu [141.211.14.93])
Received: from murder ([unix socket])
Received: from murder (mail.umich.edu [141.211.14.46])
Received: from murder ([unix socket])
Received: from murder (mail.umich.edu [141.211.14.46])
Received: from murder ([unix socket])
Received: from murder (mail.umich.edu [141.211.14.25])
Received: from murder ([unix socket])
Received: from murder (mail.umich.edu [141.211.14.36])
Received: from murder ([unix socket])
Received: from murder (mail.umich.edu [141.211.14.97])
Received: from 

In [47]:
import re
hand = open('mbox-short.txt') 
for line in hand: 
    line = line.rstrip()
    if re.search('^X\S*: [0-9.]+', line): 
        print(line) 
        
# notice th \S* is used to denote "or more" 
# The [0-9.]+ refers to a digit (0-9) or a period.. 
# inside of the [ ... ] the period matches an actual period and 
# doesn't denote a wildcard between the square brackets
# what it's trying to do is capture a number even if that number is a 
# decimal (aka has a period!) 

X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6178
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6961
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7565
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7626
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7556
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7002
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7615
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7601
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7605
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6959
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7606
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7559
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7605
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6932
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7558
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6526
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6948
X-DSPAM-Probability: 0.0000
X-DSPAM-Co

In [62]:
# Now we want to extract the numbers... 
# First, do it with split() 
testSTR = "the cat was in the tree." 

# Now extract the last part... 
# Get the last part len(testSTR.split())
# testSTR.split()[LAST PART INDEX - 1] 
last_word = testSTR.split()[len(testSTR.split())-1]
print(last_word[:-1]) # removes the period (the last character) 

tree


In [63]:
# Now try to extract it using a regular expression..
import re 

x = re.findall('tree', testSTR) 
print(x)

['tree']


In [84]:
import re 
hand = open('mbox-short.txt') 
for line in hand: 
    line = line.rstrip() 
    x = re.findall('^X\S*: ([0-9.]+)', line)
    if len(x) > 0: 
        print(x) 
        
# so what's going on here ... 
# THE KEY difference is that we are using the PARENTHESIS
# The parenthesis tell the computer to return the floating-point number
# portion of the matching string!! 


# Compare to: re.search('^X\S*: [0-9.]+', line): 
# the small difference is the use of parenthesis! 
# and of course the use of re.findall() instead of re.search() 

['0.8475']
['0.0000']
['0.6178']
['0.0000']
['0.6961']
['0.0000']
['0.7565']
['0.0000']
['0.7626']
['0.0000']
['0.7556']
['0.0000']
['0.7002']
['0.0000']
['0.7615']
['0.0000']
['0.7601']
['0.0000']
['0.7605']
['0.0000']
['0.6959']
['0.0000']
['0.7606']
['0.0000']
['0.7559']
['0.0000']
['0.7605']
['0.0000']
['0.6932']
['0.0000']
['0.7558']
['0.0000']
['0.6526']
['0.0000']
['0.6948']
['0.0000']
['0.6528']
['0.0000']
['0.7002']
['0.0000']
['0.7554']
['0.0000']
['0.6956']
['0.0000']
['0.6959']
['0.0000']
['0.7556']
['0.0000']
['0.9846']
['0.0000']
['0.8509']
['0.0000']
['0.9907']
['0.0000']
['0.7003']
['0.0000']
['0.8507']
['0.0000']
['0.9895']
['0.0000']
['0.9965']
['0.0000']
['0.9875']
['0.0000']
['0.9867']
['0.0000']
['0.9903']
['0.0000']
['0.7006']
['0.0000']
['0.9907']
['0.0000']
['0.9886']
['0.0000']
['0.8495']
['0.0000']
['0.7606']
['0.0000']
['0.9875']
['0.0000']
['0.8489']
['0.0000']
['0.9854']
['0.0000']
['0.7549']
['0.0000']
['0.9877']
['0.0000']
['0.9881']
['0.0000']
['0.9864']

In [72]:
# here we want to just extract the numbers... 
# Could you have used split? sure you could've ...

numeric_data = [] 

import re 
hand = open('mbox-short.txt') 
for line in hand: 
    line = line.rstrip() 
    if re.search('^X\S*: ([0-9.]+)', line):
        print(line.split()[1]) # this just prints it, doesn't store
        numeric_data.append(line.split()[1]) # this stores it
        


0.8475
0.0000
0.6178
0.0000
0.6961
0.0000
0.7565
0.0000
0.7626
0.0000
0.7556
0.0000
0.7002
0.0000
0.7615
0.0000
0.7601
0.0000
0.7605
0.0000
0.6959
0.0000
0.7606
0.0000
0.7559
0.0000
0.7605
0.0000
0.6932
0.0000
0.7558
0.0000
0.6526
0.0000
0.6948
0.0000
0.6528
0.0000
0.7002
0.0000
0.7554
0.0000
0.6956
0.0000
0.6959
0.0000
0.7556
0.0000
0.9846
0.0000
0.8509
0.0000
0.9907
0.0000
0.7003
0.0000
0.8507
0.0000
0.9895
0.0000
0.9965
0.0000
0.9875
0.0000
0.9867
0.0000
0.9903
0.0000
0.7006
0.0000
0.9907
0.0000
0.9886
0.0000
0.8495
0.0000
0.7606
0.0000
0.9875
0.0000
0.8489
0.0000
0.9854
0.0000
0.7549
0.0000
0.9877
0.0000
0.9881
0.0000
0.9864
0.0000
0.9870
0.0000
0.8493
0.0000
0.9837
0.0000
0.8479
0.0000
0.9852
0.0000
0.9928
0.0000
0.9898
0.0000
0.9855
0.0000
0.8484
0.0000
0.9856
0.0000
0.9892
0.0000
0.8484
0.0000
0.8518
0.0000
0.8483
0.0000
0.8524
0.0000
0.9879
0.0000
0.9843
0.0000
0.7568
0.0000
0.7568
0.0000
0.7619
0.0000
0.9871
0.0000
0.8525
0.0000
0.8477
0.0000
0.9947
0.0000
0.9832
0.0000
0.8504

In [82]:
for item in numeric_data: 
    print([item]) # a bit of a hack here

['0.8475']
['0.0000']
['0.6178']
['0.0000']
['0.6961']
['0.0000']
['0.7565']
['0.0000']
['0.7626']
['0.0000']
['0.7556']
['0.0000']
['0.7002']
['0.0000']
['0.7615']
['0.0000']
['0.7601']
['0.0000']
['0.7605']
['0.0000']
['0.6959']
['0.0000']
['0.7606']
['0.0000']
['0.7559']
['0.0000']
['0.7605']
['0.0000']
['0.6932']
['0.0000']
['0.7558']
['0.0000']
['0.6526']
['0.0000']
['0.6948']
['0.0000']
['0.6528']
['0.0000']
['0.7002']
['0.0000']
['0.7554']
['0.0000']
['0.6956']
['0.0000']
['0.6959']
['0.0000']
['0.7556']
['0.0000']
['0.9846']
['0.0000']
['0.8509']
['0.0000']
['0.9907']
['0.0000']
['0.7003']
['0.0000']
['0.8507']
['0.0000']
['0.9895']
['0.0000']
['0.9965']
['0.0000']
['0.9875']
['0.0000']
['0.9867']
['0.0000']
['0.9903']
['0.0000']
['0.7006']
['0.0000']
['0.9907']
['0.0000']
['0.9886']
['0.0000']
['0.8495']
['0.0000']
['0.7606']
['0.0000']
['0.9875']
['0.0000']
['0.8489']
['0.0000']
['0.9854']
['0.0000']
['0.7549']
['0.0000']
['0.9877']
['0.0000']
['0.9881']
['0.0000']
['0.9864']

In [85]:
# returning back to this idea.. you can grab very specific values
# by leveraging on parenthesis with re.findall()

import re 
hand = open('mbox-short.txt')
for line in hand: 
    line = line.rstrip() 
    x = re.findall('^Details:.*rev=([0-9.]+)', line) 
    if len(x) > 0: 
        print(x) 

['39772']
['39771']
['39770']
['39769']
['39766']
['39765']
['39764']
['39763']
['39762']
['39761']
['39760']
['39759']
['39758']
['39757']
['39756']
['39755']
['39754']
['39753']
['39752']
['39751']
['39750']
['39749']
['39746']
['39745']
['39744']
['39743']
['39742']
['39741']
['39740']
['39739']
['39738']
['39737']
['39736']
['39735']
['39734']
['39733']
['39732']
['39731']
['39730']
['39728']
['39729']
['39727']
['39726']
['39725']
['39724']
['39723']
['39722']
['39721']
['39720']
['39719']
['39718']
['39717']
['39716']
['39715']
['39714']
['39713']
['39712']
['39711']
['39710']
['39709']
['39708']
['39707']
['39706']
['39697']
['39696']
['39695']
['39694']
['39692']
['39691']
['39690']
['39689']
['39688']
['39687']
['39686']
['39685']
['39684']
['39683']
['39682']
['39681']
['39680']
['39679']
['39678']
['39677']
['39676']
['39675']
['39674']
['39673']
['39672']
['39671']
['39670']
['39669']
['39668']
['39667']
['39666']
['44484']
['39665']
['39664']
['39663']
['39662']
['39660']


In [86]:
import re 
hand = open('mbox-short.txt')
for line in hand: 
    line = line.rstrip() 
    x = re.findall('^From .* ([0-9][0-9]):', line) 
    if len(x) > 0: print(x)

['09']
['18']
['16']
['15']
['15']
['14']
['11']
['11']
['11']
['11']
['11']
['11']
['10']
['10']
['10']
['09']
['07']
['06']
['04']
['04']
['04']
['19']
['17']
['17']
['16']
['16']
['16']
['15']
['15']
['15']
['15']
['15']
['14']
['13']
['13']
['13']
['13']
['13']
['13']
['13']
['13']
['13']
['12']
['12']
['12']
['12']
['11']
['11']
['11']
['10']
['10']
['10']
['10']
['10']
['10']
['10']
['10']
['09']
['09']
['09']
['09']
['09']
['08']
['18']
['18']
['17']
['17']
['17']
['16']
['16']
['15']
['15']
['15']
['14']
['12']
['12']
['11']
['11']
['09']
['09']
['09']
['09']
['09']
['09']
['08']
['08']
['08']
['08']
['07']
['05']
['04']
['04']
['03']
['03']
['03']
['03']
['02']
['20']
['07']
['16']
['16']
['14']
['14']
['14']
['10']
['09']
['22']
['16']
['16']
['13']
['12']
['10']
['10']
['21']
['14']
['10']
['10']
['23']
['16']
['14']
['12']
['12']
['11']
['21']
['21']
['17']
['17']
['17']
['17']
['17']
['13']
['12']
['10']
['10']
['10']
['09']
['09']
['09']
['09']
['08']
['21']
['16']
['11']

In [99]:
import re
x = "We just received $10.00 for cookies." 
y = re.findall('\$[0-9]+.[0-9][0-9]',x)
print(y) # good use of + to grab that greedy

y_test = re.findall('\$[0-9][0-9]',x)
z_test = re.findall('\$[0-9]+',x)
print(y_test, z_test) 

# you can see that it's equivalent
# but using the + makes it much more flexible.

['$10.00']
['$10'] ['$10']
