# Matching and Extracting Data
- re.search() returns a True/False depending on whether the string matches the regular expression
If we actually want the matching strings to be extracted, we use re.findall()

In [2]:
import re
x = 'My 2 favorite numbers are 19 and 42'
y = re.findall('[0-9]+', x)
print(y)

['2', '19', '42']


The expression above finds all instances where there are any numbers (one or more) in x. Here the square brackets count as a single digit and they can contain a range, a series of characters, etc. [0-9] is a single digits but the + tells that it can be one or more.

When we use re.findall(), it returns a list of zero or more sub-strings that match the regular expression

In [3]:
import re
y = re.findall('AEIOU+', x)

In the code above we are looking for any uppercase vowel that is followed by one or more characters.

## Warning: Greedy Matching
The repeat characters(* and +) push outward in both directions (greedy) to match the largest possible string

In [5]:
import re
x = 'From: using the : character'
y = re.findall('^F.+:', x)
print(y)

['From: using the :']


In the code above, '^F.+:':
- ^F = tells us the first character in the match is an F
- .+ = tells us there is one or more characters after the F
- : = tells us the last character in the match is a :

While 'From:' is a match, since 'From: using the :' is also a match, findall will return the largest option. This is known as greedy matching, meaning as large as possible string


## Non-Greedy Matching
Not all regular expression repeat codes are greedy! If you add a ? character, the + and * chill out a bit

In [7]:
import re
x = 'From: using the : character'
y = re.findall('^F.+?:', x)
print(y)

['From:']


In the code above, '^F.+?:':
- ^F = tells us the first character in the match is an F
- .+? = tells us there is one or more characters after the F but not greedy
- : = tells us the last character in the match is a :

The not greedy matching prefers the shortest string option


## Fine-Tuning String Extraction
You can refine the ,atch for re.findall() and separately determine which portion of the match is to be extracted by using parenthesis

In [8]:
import re
x = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('\S+@\S+', x)
print(y)

['stephen.marquard@uct.ac.za']


  y = re.findall('\S+@\S+', x)


In the code above:
- '\S+' = there is at least non-whitespace character before amd after the '@'

If it was non greedy matching, the result would be 'd@u" instead of 'stephen.marquard@uct.ac.za'


### Parenthesis
Parenthesis are not part of the match, but they will tell where to start and stop what string to extract

In [9]:
import re
x = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('^From (\S+@\S+)', x)
print(y)

['stephen.marquard@uct.ac.za']


  y = re.findall('^From (\S+@\S+)', x)


Here, the 'From' is part of the match but the only portion of the string that is extracted is the one that matches what's inside the parenthesis.

In [10]:
data = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'

atpos = data.find('@')
print(atpos)

sppos = data.find(' ', atpos)
print(sppos)

host = data[atpos+1 : sppos]
print(host)

21


## The Double Split Pattern
Sometimes we split a line one way, and then grab one of the pieces of the line an dsplit that piece again

In [None]:
line = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'

words = line.split()
email = words[1]
pieces = email.split('@')
print(pieces[1])

## The Regex Version

In [None]:
import re

lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('@([^ ]*)', lin)
print(y)

In the code above, you will look through the string until you find an '@' sign.
- '[^ ]' = match non-blank character
- '*' = Match many of them

## Even Cooler Regex Version

In [None]:
import re

lin = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('^From .*@([^ ]*)', lin)
print(y)

In the code above:
- '^' = starting at the beginning of the line
- 'From' = look for the string From
- Then a space
- '.*@' = any number of character up to an at
- '([^ ]*)' = begin extracting all the non-blank characters and then end extracting

## Spam Confidence

In [11]:
import re

hand = open('mbox-short-2.txt')
numlist = list()

for line in hand:
    line = line.rstrip()
    stuff = re.findall('^X-DSPAM-Confidence: ([0-9.]+)', line)
    
    if len(stuff) != 1:
        continue
    num = float(stuff[0])
    numlist.append(num)
print('Maximum:', max(numlist))


##MINUTO 12:00

Maximum: 0.9907


# Escape Character
If you want a special regular expression character to just behave normally (most of the time) you prefix it with '\'

In [12]:
import re
x = 'We just received $10.00 for cookies'
y = re.findall('\$[0-9.]+', x)
print(y)

['$10.00']


  y = re.findall('\$[0-9.]+', x)


In the code above:
- '\$\ = a real dollar sign
- '[0-9].' = a digit or period
- '+' = at least one more character