## String Manipulation

Python has long been a popular data munging language in part due to its ease-of-use
for string and text processing. Most text operations are made simple with the string
object’s built-in methods. For more complex pattern matching and text manipulations,
regular expressions may be needed. pandas adds to the mix by enabling you to apply
string and regular expressions concisely on whole arrays of data, additionally handling
the annoyance of missing data.

In [1]:
from pandas import DataFrame, Series

import pandas as pd

import sys

import numpy as np

import json

from ipykernel import kernelapp as app


### String Object Methods
In many string munging and scripting applications, built-in string methods are sufficient.
As an example, a comma-separated string can be broken into pieces with split:

In [2]:
val = 'a,b, guido'

In [3]:
val

'a,b, guido'

In [4]:
val.split(',')

['a', 'b', ' guido']

In [5]:
pieces = [x.strip() for x in val.split(',')]

In [6]:
pieces

['a', 'b', 'guido']

These substrings could be concatenated together with a two-colon delimiter using addition:

In [7]:
first, second, third = pieces

In [8]:
first + '::' + second + '::' + third

'a::b::guido'

But, this isn’t a practical generic method. A faster and more Pythonic way is to pass a list or tuple to the join method on the string '::':

In [9]:
'::'.join(pieces)

'a::b::guido'

Other methods are concerned with locating substrings. Using Python’s in keyword is
the best way to detect a substring, though index and find can also be used:

In [10]:
'guido' in val

True

In [11]:
val.index(',')

1

In [12]:
val.find(':')

-1

Note the difference between find and index is that index raises an exception if the string
isn’t found (versus returning -1):

In [13]:
val.index(':')

ValueError: substring not found

Relatedly, count returns the number of occurrences of a particular substring:

In [14]:
val.count(',')

2

replace will substitute occurrences of one pattern for another. This is commonly used
to delete patterns, too, by passing an empty string:

In [15]:
val.replace(',', '::')

'a::b:: guido'

In [16]:
val.replace(',', '')

'ab guido'

Regular expressions can also be used with many of these operations as you’ll see below.

Table 7-3. Python built-in string methods

Argument Description

count Return the number of non-overlapping occurrences of substring in the string.

endswith, startswith Returns True if string ends with suffix (starts with prefix).

join Use string as delimiter for concatenating a sequence of other strings.

index Return position of first character in substring if found in the string. Raises ValueError if not found.

find Return position of first character of first occurrence of substring in the string. Like index, but returns -1 if not found.

rfind Return position of first character of last occurrence of substring in the string. Returns -1 if not found.

replace Replace occurrences of string with another string.

strip, rstrip, lstrip Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively) for each element.

split Break string into list of substrings using passed delimiter.

lower, upper Convert alphabet characters to lowercase or uppercase, respectively.
 
ljust, rjust Left justify or right justify, respectively. Pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width.

## Regular expressions

Regular expressions provide a flexible way to search or match string patterns in text. A
single expression, commonly called a regex, is a string formed according to the regular
expression language. Python’s built-in re module is responsible for applying regular
expressions to strings; I’ll give a number of examples of its use here.

NOTE: The art of writing regular expressions could be a chapter of its own and
thus is outside the book’s scope. There are many excellent tutorials and
references on the internet, such as Zed Shaw’s Learn Regex The Hard
Way (http://regex.learncodethehardway.org/book/).

The re module functions fall into three categories: pattern matching, substitution, and
splitting. Naturally these are all related; a regex describes a pattern to locate in the text,
which can then be used for many purposes. Let’s look at a simple example: suppose I
wanted to split a string with a variable number of whitespace characters (tabs, spaces,
and newlines). The regex describing one or more whitespace characters is \s+:

In [18]:
import re

In [19]:
text = "foo bar\t baz \tqux"

In [20]:
re.split('\s', text)

['foo', 'bar', '', 'baz', '', 'qux']

When you call re.split('\s+', text), the regular expression is first compiled, then its
split method is called on the passed text. You can compile the regex yourself with
re.compile, forming a reusable regex object:

In [21]:
regex = re.compile('\s+')

In [22]:
regex.split(text)

['foo', 'bar', 'baz', 'qux']

If, instead, you wanted to get a list of all patterns matching the regex, you can use the
findall method:

In [23]:
regex.findall(text)

[' ', '\t ', ' \t']

NOTE: To avoid unwanted escaping with \ in a regular expression, use raw
string literals like r'C:\x' instead of the equivalent 'C:\\x'.

Creating a regex object with re.compile is highly recommended if you intend to apply
the same expression to many strings; doing so will save CPU cycles.

match and search are closely related to findall. While findall returns all matches in a
string, search returns only the first match. More rigidly, match only matches at the
beginning of the string. As a less trivial example, let’s consider a block of text and a
regular expression capable of identifying most email addresses:


In [24]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

Using findall on the text produces a list of the e-mail addresses:

In [26]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

search returns a special match object for the first email address in the text. For the
above regex, the match object can only tell us the start and end position of the pattern
in the string:

In [27]:
m = regex.search(text)

In [28]:
m

<_sre.SRE_Match object; span=(5, 20), match='dave@google.com'>

In [29]:
text[m.start():m.end()]

'dave@google.com'

regex.match returns None, as it only will match if the pattern occurs at the start of the
string:

In [31]:
print (regex.match(text))

None


Relatedly, sub will return a new string with occurrences of the pattern replaced by the
a new string:

In [33]:
print (regex.sub('REDACTED', text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED



Suppose you wanted to find email addresses and simultaneously segment each address
into its 3 components: username, domain name, and domain suffix. To do this, put
parentheses around the parts of the pattern to segment:

In [34]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'

In [35]:
regex = re.compile(pattern, flags=re.IGNORECASE)

A match object produced by this modified regex returns a tuple of the pattern components
with its groups method:

In [36]:
m = regex.match('wesm@bright.net')

In [37]:
m.groups()

('wesm', 'bright', 'net')

findall returns a list of tuples when the pattern has groups:

In [38]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

sub also has access to groups in each match using special symbols like \1, \2, etc.:

In [40]:
print (regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com



There is much more to regular expressions in Python, most of which is outside the
book’s scope. To give you a flavor, one variation on the above email regex gives names
to the match groups:

In [41]:
regex = re.compile(r"""
(?P<username>[A-Z0-9._%+-]+)
@
(?P<domain>[A-Z0-9.-]+)
\.
(?P<suffix>[A-Z]{2,4})""", flags=re.IGNORECASE|re.VERBOSE)

The match object produced by such a regex can produce a handy dict with the specified
group names:

In [43]:
m = regex.match('wesm@bright.net')


In [44]:
m.groupdict()

{'domain': 'bright', 'suffix': 'net', 'username': 'wesm'}

Table 7-4. Regular expression methods

Argument Description

findall, finditer Return all non-overlapping matching patterns in a string. findall returns a list of all
patterns while finditer returns them one by one from an iterator.

match Match pattern at start of string and optionally segment pattern components into groups.
If the pattern matches, returns a match object, otherwise None.

search Scan string for match to pattern; returning a match object if so. Unlike match, the match
can be anywhere in the string as opposed to only at the beginning.

split Break string into pieces at each occurrence of pattern.

sub, subn Replace all (sub) or first n occurrences (subn) of pattern in string with replacement
expression. Use symbols \1, \2, ... to refer to match group elements in the replacement
string.

## Vectorized string functions in pandas
Cleaning up a messy data set for analysis often requires a lot of string munging and
regularization. To complicate matters, a column containing strings will sometimes have
missing data:
    

In [45]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}

In [46]:
data = Series(data)

In [47]:
data

Dave     dave@google.com
Rob        rob@gmail.com
Steve    steve@gmail.com
Wes                  NaN
dtype: object

In [48]:
data.isnull()

Dave     False
Rob      False
Steve    False
Wes       True
dtype: bool

String and regular expression methods can be applied (passing a lambda or other function)
to each value using data.map, but it will fail on the NA. To cope with this, Series
has concise methods for string operations that skip NA values. These are accessed
through Series’s str attribute; for example, we could check whether each email address
has 'gmail' in it with str.contains:

In [49]:
data.str.contains('gmail')

Dave     False
Rob       True
Steve     True
Wes        NaN
dtype: object

Regular expressions can be used, too, along with any re options like IGNORECASE:

In [58]:
pattern

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

In [59]:
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Rob        [(rob, gmail, com)]
Steve    [(steve, gmail, com)]
Wes                        NaN
dtype: object

There are a couple of ways to do vectorized element retrieval. Either use str.get or
index into the str attribute:

In [60]:
matches = data.str.match(pattern, flags=re.IGNORECASE)

  if __name__ == '__main__':


In [61]:
matches

Dave     (dave, google, com)
Rob        (rob, gmail, com)
Steve    (steve, gmail, com)
Wes                      NaN
dtype: object

In [62]:
matches.str.get(1)

Dave     google
Rob       gmail
Steve     gmail
Wes         NaN
dtype: object

In [63]:
matches.str[0]

Dave      dave
Rob        rob
Steve    steve
Wes        NaN
dtype: object

You can similarly slice strings using this syntax:

In [64]:
data.str[:5]

Dave     dave@
Rob      rob@g
Steve    steve
Wes        NaN
dtype: object

Table 7-5. Vectorized string methods
Method Description

cat Concatenate strings element-wise with optional delimiter

contains Return boolean array if each string contains pattern/regex

count Count occurrences of pattern

endswith, startswith Equivalent to x.endswith(pattern) or x.startswith(pattern) for each element.

findall Compute list of all occurrences of pattern/regex for each string

get Index into each element (retrieve i-th element)

join Join strings in each element of the Series with passed separator

len Compute length of each string

lower, upper Convert cases; equivalent to x.lower() or x.upper() for each element.

match Use re.match with the passed regular expression on each element, returning matched
groups as list.

pad Add whitespace to left, right, or both sides of strings

center Equivalent to pad(side='both')

repeat Duplicate values; for example s.str.repeat(3) equivalent to x * 3 for each string.

replace Replace occurrences of pattern/regex with some other string

slice Slice each string in the Series.

split Split strings on delimiter or regular expression

strip, rstrip, lstrip Trim whitespace, 
including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively) for each element.