# Vectorized String Operations

One strength of Python is its relative ease in handling and manipulating string data. Pandas builds on this and provides a comprehensive set of vectorized string operations that become an essential piece of the type of munging required when working with (read: cleaning up) real-world data. In this section, we'll walk through some of the Pandas string operations, and then take a look at using them to partially clean up a very messy dataset of recipes collected from the Internet.

## Introducing Pandas String Operations

We saw in previous sections how tools like NumPy and Pandas generalize arithmetic operations so that we can easily and quickly perform the same operation on many array elements. For example:

In [1]:
import numpy as np
x = np.array([2, 3, 5, 7, 11, 13])

x* 2

array([ 4,  6, 10, 14, 22, 26])

The vectorization of operations simplifies the syntax of operation on arrays of data: We no longer have to worry about the size or shape of the array, but just about what operation we want done. For arrays of strings, NumPy does not privide such simple access, and this you're stuck using a more verbose look syntax:


In [3]:
data = ['peter', 'MARY', 'jORGE']
[s.capitalize() for s in data]

['Peter', 'Mary', 'Jorge']

For cleaned data, this will work perfectly, however if there is any missing data, it will brake.


In [4]:
data = ['peter', None, 'joRGE', 'MARY']
[s.capitalize() for s in data]

AttributeError: 'NoneType' object has no attribute 'capitalize'

Pandas includes features to address both this need for vectorized string operations and for correctly handling missing data via the **str** attribute of Pandas series and Index containing strings. So, for example, suppose we create a Pandas Series with this data.

In [5]:
import pandas as pd
names = pd.Series(data)
names

0    peter
1     None
2    joRGE
3     MARY
dtype: object

In [13]:
names.str.capitalize()
names.fillna('none') #LOLOLOL I CAN CODEEEEE!!!!!!! MUAHHAHAHAHAHA

0    peter
1     none
2    joRGE
3     MARY
dtype: object

In [12]:
names.str.capitalize()

0    Peter
1     None
2    Jorge
3     Mary
dtype: object

Using tab completeion on this **str** attribute will list all the vectorized string methods available to Pandas

## Tables of Pandas String Methods

If you have a good understanding of string manipulation in Python, most of Pandas string syntax is intuitive enough that it's probably sufficient to just list a table of available methods; we will start with that here, before diving deeper into a few of the subtleties. The examples in this section use the folloing series of names.

In [14]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])

## Methods similiar to Python string methods

Nearly all Pythons built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas **str** methods that mirror Python string methods:

In [29]:
methods = pd.DataFrame[['len()', 'lower()', 'translate()', 'islower()'],
           ['ljust()', 'upper()', 'startswith()', 'isupper'],
           ['center()', 'rfind()','isalnum()', 'isdecimal()'],
           ['zfill()', 'index()', 'isalpha()','split()'],
           ['strip()', 'rindex()', 'isdigit()','rsplit()'],
           ['rstrip()', 'capitalize()', 'isspace()','partition()'],
           ['lstrip()', 'swapcase()', 'istitle()','rpartition()']]
methods

[['len()', 'lower()', 'translate()', 'islower()'],
 ['ljust()', 'upper()', 'startswith()', 'isupper'],
 ['center()', 'rfind()', 'isalnum()', 'isdecimal()'],
 ['zfill()', 'index()', 'isalpha()', 'split()'],
 ['strip()', 'rindex()', 'isdigit()', 'rsplit()'],
 ['rstrip()', 'capitalize()', 'isspace()', 'partition()'],
 ['lstrip()', 'swapcase()', 'istitle()', 'rpartition()']]

In [34]:
monte.str.lower()

0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object

But some others return numbers

In [35]:
monte.str.len()

0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

In [36]:
monte.str.startswith('T')

0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool

Still others return lists or other compound values for each element:

In [37]:
monte.str.split()

0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

we'll see further manipulations of this kind of series-of-lists object as we continue our discussion

## Methods using regular expressions

In addition, there are several methods that accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python's built-in re module:

In [64]:
monte.str.extract('([A-Za-z]+)', expand=False)

0     Graham
1       John
2      Terry
3       Eric
4      Terry
5    Michael
dtype: object

Or we can do something more complicated, like finding all names that start and end with a consonant, making use of the start-of-string(^) and end-of-string ($) regular expression characters:

In [65]:
monte.str.findall(r'[^AEIOU].*[^aeiou]$')

0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object


The ability to concisely apply regular expressions across *series* or *DataFrame* entries opens up many possibilities, for analysis and cleaning of data.

### Vectorized item access and slicing

The *get()* and *slice()* operations, in particular, enable vectorized element access from each array. For example, we can get a slice of the first three characters of each array using str.slice(0, 3). Note that this behavior is also available through Python's normal indexing syntax–for example, df.str.slice(0, 3) is equivalent to df.str[0:3]:

In [66]:
monte.str[0:3]

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object

Indexing via df.str.get(i) and df.str[1] is likewise similar.

These get() and slice() methods also let you access elements of arrays returned by split(). For example, to extract the last name of each entry, we can combine split() and get()

In [69]:
monte.str.split().str.get(-1)

0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object

### Indicator variables

Antoher method that reqwuires a bit of extyra exaplnation is the get_dummies() method. THis is useful when your data has a column containting some sort of coded indicator. For example, we might have a dataset that contains information in the form of codes, such as A="born in America", B="born in the United Kingdom", C="likes Cheese", D="likes Spam"

In [71]:
full_monte = pd.DataFrame({'name':monte,
                           'info': ['B|C|D', 'B|D', 'A|C', 
                                    'B|D', 'B|C', 'B|C|D']})

In [73]:
full_monte

Unnamed: 0,info,name
0,B|C|D,Graham Chapman
1,B|D,John Cleese
2,A|C,Terry Gilliam
3,B|D,Eric Idle
4,B|C,Terry Jones
5,B|C|D,Michael Palin


The get_dummies() routine lets you quickly split-out these indcator variables into a DataFrame:

In [75]:
full_monte['info'].str.get_dummies('|')

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,1,0
5,0,1,1,1


With these operations as building blocks, you can construct an endless range of string processing procedures when cleaning your data.

## Example : Recipe Database

These vectorized string operations become most useful in the process of cleaning messy, real-world data. Here we'll walk thourhg an example, using an open recipe database compiled from various sources on the web. Our goal will be to parse the recipe data into ingredient lists, so we can quick;y find a recipe based on some ingredient we have on hand.



In [89]:
!curl -O http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    20  100    20    0     0     63      0 --:--:-- --:--:-- --:--:--    63


In [98]:
!gunzip recipeitems-latest.json.gz

'gunzip' is not recognized as an internal or external command,
operable program or batch file.


'gunzip' is not recognized as an internal or external command,
operable program or batch file.
