In [2]:
import re
import pandas as pd
import numpy as np

In [2]:
# split a string with a variable number of whitespace characters
# (tabs, spaces, and newlines). The regex describing one or more whitespace characters
# is \s+:
text = "foo bar\t baz \tqux"
re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

In [3]:
regex = re.compile('\s+')
regex.split(text)

['foo', 'bar', 'baz', 'qux']

In [None]:
# If, instead, you wanted to get a list of all patterns matching the regex, you can use the
# findall method:
regex.findall(text)

__Creating a regex object with re.compile is highly recommended if you intend to apply the same expression to many strings; doing so will save CPU cycles.__

match and search are closely related to findall. 

While findall returns all matches in a string, 

search returns only the first match. 

More rigidly, match only matches at the beginning of the string. 

As a less trivial example, let’s consider a block of text and a regular expression capable of identifying most email addresses:

In [4]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

In [5]:
# Using findall on the text produces a list of the email addresses:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

In [7]:
# search returns a special match object for the first email address in the text.
m = regex.search(text)
m

<re.Match object; span=(5, 20), match='dave@google.com'>

In [8]:
text[m.start():m.end()]

'dave@google.com'

In [9]:
# regex.match returns None, as it only will match if the pattern occurs at the start of the string:
print(regex.match(text))

None


In [10]:
# Relatedly, sub will return a new string with occurrences of the pattern replaced by the a new string:
print(regex.sub('REDACTED', text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED



__Suppose you wanted to find email addresses and simultaneously segment each address into its three components: username, domain name, and domain suffix. To do this, put parentheses around the parts of the pattern to segment:__

In [11]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)

In [13]:
# A match object produced by this modified regex returns a tuple of the pattern com‐
# ponents with its groups method:
m = regex.match('wesm@bright.net')
m.groups()

('wesm', 'bright', 'net')

In [14]:
# findall returns a list of tuples when the pattern has groups:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

sub also has access to groups in each match using special symbols like \1 and \2. The
symbol \1 corresponds to the first matched group, \2 corresponds to the second, and
so forth:

In [15]:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com



### __Vectorized-String-Functions__

In [3]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)
data

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes                  NaN
dtype: object

In [4]:
data.isnull()

Dave     False
Steve    False
Rob      False
Wes       True
dtype: bool

You can apply string and regular expression methods can be applied (passing a
lambda or other function) to each value using data.map, but it will fail on the NA
(null) values. To cope with this, Series has array-oriented methods for string opera‐
tions that skip NA values. These are accessed through Series’s str attribute; for exam‐
ple, we could check whether each email address has 'gmail' in it with str.contains:

In [5]:
data.str.contains('gmail')

Dave     False
Steve     True
Rob       True
Wes        NaN
dtype: object

In [6]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
pattern

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

In [7]:
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Steve    [(steve, gmail, com)]
Rob        [(rob, gmail, com)]
Wes                        NaN
dtype: object

There are a couple of ways to do vectorized element retrieval. Either use str.get or
index into the str attribute:

In [9]:
matches = data.str.match(pattern, flags=re.IGNORECASE)
matches

Dave     True
Steve    True
Rob      True
Wes       NaN
dtype: object

You can similarly slice strings using this syntax:

In [10]:
data.str[:5]

Dave     dave@
Steve    steve
Rob      rob@g
Wes        NaN
dtype: object