# Vectorized String Operations
One strength of Python is its relative ease in handling and manipulating string data.
Pandas builds on this and provides a comprehensive set of vectorized string operations
that become an essential piece of the type of munging required when one is working
with (read: cleaning up) real-world data. In this section, we’ll walk through some of
the Pandas string operations, and then take a look at using them to partially clean up
a very messy dataset of recipes collected from the Internet.

## Introducing Pandas String Operations

Tools like NumPy and Pandas generalize arithmetic
operations so that we can easily and quickly perform the same operation on many
array elements. For example:

In [1]:
import numpy as np 
x = np.array([1, 2, 3, 4])
x * 100

array([100, 200, 300, 400])

This <i>vectorization</i> of operations simplifies the syntax of operating on arrays of data:
we no longer have to worry about the size or shape of the array, but just about what
operation we want done. 

For arrays of strings, NumPy does not provide such simple
access, and thus you’re stuck using a more verbose loop syntax:

In [2]:
data = ['rodgers', 'omondi', 'nyangweso']
[i.capitalize() for i in data]

['Rodgers', 'Omondi', 'Nyangweso']

This is perhaps sufficient to work with some data, but it will break if there are any
missing values. For example:

In [3]:
data = ['rodgers', 'omondi', None, 'nyangweso']
[i.capitalize() for i in data]

AttributeError: 'NoneType' object has no attribute 'capitalize'

Pandas includes features to address both this need for vectorized string operations
and for correctly handling missing data via the str attribute of Pandas Series and
Index objects containing strings. So, for example, suppose we create a Pandas Series
with this data:

In [4]:
import pandas as pd
names = pd.Series(data)
names

0      rodgers
1       omondi
2         None
3    nyangweso
dtype: object

We can now call a single method that will capitalize all the entries, while skipping
over any missing values:

In [5]:
names.str.capitalize()

0      Rodgers
1       Omondi
2         None
3    Nyangweso
dtype: object

### Tables of Pandas String Methods
If you have a good understanding of string manipulation in Python, most of Pandas’
string syntax is intuitive enough that it’s probably sufficient to just list a table of available
methods; we will start with that here, before diving deeper into a few of the subtleties.
The examples in this section use the following series of names:

In [6]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam','Eric Idle', 'Terry Jones', 'Michael Palin'])
monte

0    Graham Chapman
1       John Cleese
2     Terry Gilliam
3         Eric Idle
4       Terry Jones
5     Michael Palin
dtype: object

### Methods similar to Python string methods
Nearly all Python’s built-in string methods are mirrored by a Pandas vectorized string
method. Here is a list of Pandas str methods that mirror Python string methods:

<i>len(); ljust(); rjust(); center(); zfill(); strip(); rstrip(); lstrip(); lower(); upper(); find(); rfind(); index(); rindex(); capitalize(); swapcase(); translate(); startswith(); endswith(); isalnum(); isalpha(); isdigit(); isspace(); istitle(); islower(); isupper(); isnumeric(); isdecimal(); split(); rsplit(); partition(); rpartition()</i>

Notice that these have various return values. Some, like lower(), return a series of
strings:

In [7]:
monte.str.lower()

0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object

But some others return numbers:

In [8]:
monte.str.len()

0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

In [9]:
# OR Boolean values:
monte.str.startswith('T')

0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool

In [10]:
# Still others return lists or other compound values for each element:
monte.str.split()

0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

## Methods using regular expressions
In addition, there are several methods that accept regular expressions to examine the
content of each string element, and follow some of the API conventions of Python’s
built-in re module

### Method & Description
match(): Call re.match() on each element, returning a Boolean.

extract(): Call re.match() on each element, returning matched groups as strings.

findall(): Call re.findall() on each element.

replace(): Replace occurrences of pattern with some other string.

contains(): Call re.search() on each element, returning a Boolean.

count(): Count occurrences of pattern.

split(): Equivalent to str.split(), but accepts regexps.

rsplit(): Equivalent to str.rsplit(), but accepts regexps.

With these, you can do a wide range of interesting operations. For example, we can
extract the first name from each by asking for a contiguous group of characters at the
beginning of each element:

In [11]:
# extract the first name from each 
monte.str.extract('([A-Za-z]+)')

Unnamed: 0,0
0,Graham
1,John
2,Terry
3,Eric
4,Terry
5,Michael


Or we can do something more complicated, like finding all names that start and end
with a consonant, making use of the start-of-string (^) and end-of-string ($) regular
expression characters:

In [12]:
# finding all names that start with and ends with a consonant
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

The ability to concisely apply regular expressions across Series or DataFrame entries
opens up many possibilities for analysis and cleaning of data.

### Miscellaneous methods
Finally, there are some miscellaneous methods that enable other convenient operations

### Method & Description
get(): Index each element

slice(): Slice each element

slice_replace(): Replace slice in each element with passed value

cat(): Concatenate strings

repeat(): Repeat values

normalize(): Return Unicode form of string

pad(): Add whitespace to left, right, or both sides of strings

wrap(): Split long strings into lines with length less than a given width

join(): Join strings in each element of the Series with passed separator

get_dummies(): Extract dummy variables as a DataFrame

<mark><b>Vectorized item access and slicing</b></mark>. The <i>get()</i> and <i>slice()</i> operations, in particular,
enable vectorized element access from each array. For example, we can get a slice of
the first three characters of each array using str.slice(0, 3). Note that this behavior
is also available through Python’s normal indexing syntax—for example,
df.str.slice(0, 3) is equivalent to df.str[0:3]:

In [13]:
# slice, get
print(monte.str[0:3]); print(monte.str.slice(2)); print(monte.str.get(2)); print(monte.str.get(-1))

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object
0    aham Chapman
1       hn Cleese
2     rry Gilliam
3         ic Idle
4       rry Jones
5     chael Palin
dtype: object
0    a
1    h
2    r
3    i
4    r
5    c
dtype: object
0    n
1    e
2    m
3    e
4    s
5    n
dtype: object


Indexing via df.str.get(i) and df.str[i] is similar.

In [14]:
# split
monte.str.split()

0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

These <i>get()</i> and<i> slice()</i> methods also let you access elements of arrays returned by
<i>split()</i>. For example, to extract the last name of each entry, we can combine
split() and get():

In [15]:
monte.str.split().str.get(-1)

0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object

<mark><b>Indicator variables</b></mark>. Another method that requires a bit of extra explanation is the
<i>get_dummies()</i> method. This is useful when your data has a column containing some
sort of coded indicator. For example, we might have a dataset that contains information
in the form of codes, such as A=“born in America,” B=“born in the United Kingdom,”
C=“likes cheese,” D=“likes spam”:

In [16]:
full_monte = pd.DataFrame({'name': monte,'info': ['B|C|D', 'B|D', 'A|C', 'B|D', 'B|C','B|C|D']})
full_monte

Unnamed: 0,name,info
0,Graham Chapman,B|C|D
1,John Cleese,B|D
2,Terry Gilliam,A|C
3,Eric Idle,B|D
4,Terry Jones,B|C
5,Michael Palin,B|C|D


The <i>get_dummies()</i> routine lets you quickly split out these indicator variables into a
DataFrame:

In [17]:
full_monte['info'].str.get_dummies('|')

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,1,0
5,0,1,1,1


With these operations as building blocks, you can construct an endless range of string
processing procedures when cleaning your data.

## Example: Recipe Database
These vectorized string operations become most useful in the process of cleaning up
messy, real-world data. Here I’ll walk through an example of that, using an open
recipe database compiled from various sources on the Web. Our goal will be to parse
the recipe data into ingredient lists, so we can quickly find a recipe based on some
ingredients we have on hand.

The scripts used to compile this can be found at https://github.com/fictivekin/openrecipes , and the link to the current version of the database is found there as well.
As of spring 2016, this database was about 30 MB.

The database is in JSON format, so we will try pd.read_json to read it:

In [18]:
try:
    recipes = pd.read_json('openrecipes.json')
except ValueError as e:
    print("ValueError:", e)

ValueError: Trailing data


Oops! We get a ValueError mentioning that there is “trailing data.” Searching for this
error on the Internet, it seems that it’s due to using a file in which each line is itself a
valid JSON, but the full file is not. Let’s check if this interpretation is true:

In [19]:
# read the entire file into a Python array
with open('openrecipes.json', 'r') as f:
    # Extract each line
    data = (line.strip() for line in f)
    # Reformat so each line is the element of a list
    data_json = "[{0}]".format(','.join(data))
# read the result as a JSON
recipes = pd.read_json(data_json)
recipes.shape

(1042, 9)

We see there are nearly 1042 recipes, and 9 columns. Let’s take a look at one row
to see what we have:

In [20]:
recipes.iloc[0]

name                                      Easter Leftover Sandwich
ingredients      12 whole Hard Boiled Eggs\n1/2 cup Mayonnaise\...
url              http://thepioneerwoman.com/cooking/2013/04/eas...
image            http://static.thepioneerwoman.com/cooking/file...
cookTime                                                        PT
recipeYield                                                      8
datePublished                                           2013-04-01
prepTime                                                     PT15M
description      Got leftover Easter eggs?    Got leftover East...
Name: 0, dtype: object

There is a lot of information there, but much of it is in a very messy form, as is typical
of data scraped from the Web. In particular, the ingredient list is in string format;
we’re going to have to carefully extract the information we’re interested in. Let’s start
by taking a closer look at the ingredients:

In [21]:
recipes.ingredients.str.len().describe()

count    1042.000000
mean      358.645873
std       187.332133
min        22.000000
25%       246.250000
50%       338.000000
75%       440.000000
max      3160.000000
Name: ingredients, dtype: float64

The ingredient lists average 358 characters long, with a minimum of 22 and a maximum
of nearly 31600 characters!

Just out of curiosity, let’s see which recipe has the longest ingredient list:

In [22]:
recipes.name[np.argmax(recipes.ingredients.str.len())]

'A Nice Berry Pie'

We can do other aggregate explorations; for example, let’s see how many of the recipes
are for breakfast food:

In [23]:
recipes.description.str.contains('[Bb]reakfast').sum()

11

Or how many of the recipes list cinnamon as an ingredient:

In [24]:
recipes.ingredients.str.contains('[Cc]innamon').sum()

79

We could even look to see whether any recipes misspell the ingredient as “cinamon”:

In [25]:
recipes.ingredients.str.contains('[Cc]inamon').sum()

0

This is the type of essential data exploration that is possible with Pandas string tools.
It is data munging like this that Python really excels at.