# Vectorized String Operations

One strength of Python is its relative ease in handling and manipulating string data. Pandas builds on this and provides a comprehensive set of vectorized string operations that become an essential piece of the type of munging required when working with (read: cleaning up) real-world data. In this section, we'll walk through some of the Pandas string operations, and then take a look at using them to partially clean up a very messy dataset of recipes collected from the Internet.

## Introducing Pandas String Operations

We saw in previous sections how tools like NumPy and Pandas generalize arithmetic operations so that we can easily and quickly perform the same operation on many array elements. For example:

In [1]:
import numpy as np
x = np.array([2, 3, 5, 7, 11, 13])

x* 2

array([ 4,  6, 10, 14, 22, 26])

The vectorization of operations simplifies the syntax of operation on arrays of data: We no longer have to worry about the size or shape of the array, but just about what operation we want done. For arrays of strings, NumPy does not privide such simple access, and this you're stuck using a more verbose look syntax:


In [3]:
data = ['peter', 'MARY', 'jORGE']
[s.capitalize() for s in data]

['Peter', 'Mary', 'Jorge']

For cleaned data, this will work perfectly, however if there is any missing data, it will brake.


In [4]:
data = ['peter', None, 'joRGE', 'MARY']
[s.capitalize() for s in data]

AttributeError: 'NoneType' object has no attribute 'capitalize'

Pandas includes features to address both this need for vectorized string operations and for correctly handling missing data via the **str** attribute of Pandas series and Index containing strings. So, for example, suppose we create a Pandas Series with this data.

In [5]:
import pandas as pd
names = pd.Series(data)
names

0    peter
1     None
2    joRGE
3     MARY
dtype: object

In [13]:
names.str.capitalize()
names.fillna('none') #LOLOLOL I CAN CODEEEEE!!!!!!! MUAHHAHAHAHAHA

0    peter
1     none
2    joRGE
3     MARY
dtype: object

In [12]:
names.str.capitalize()

0    Peter
1     None
2    Jorge
3     Mary
dtype: object

Using tab completeion on this **str** attribute will list all the vectorized string methods available to Pandas

## Tables of Pandas String Methods

If you have a good understanding of string manipulation in Python, most of Pandas string syntax is intuitive enough that it's probably sufficient to just list a table of available methods; we will start with that here, before diving deeper into a few of the subtleties. The examples in this section use the folloing series of names.

In [14]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])

## Methods similiar to Python string methods

Nearly all Pythons built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas **str** methods that mirror Python string methods:

In [29]:
methods = [['len()', 'lower()', 'translate()', 'islower()'],
           ['ljust()', 'upper()', 'startswith()', 'isupper'],
           ['center()', 'rfind()','isalnum()', 'isdecimal()'],
           ['zfill()', 'index()', 'isalpha()','split()'],
           ['strip()', 'rindex()', 'isdigit()','rsplit()'],
           ['rstrip()', 'capitalize()', 'isspace()','partition()'],
           ['lstrip()', 'swapcase()', 'istitle()','rpartition()']]
methods

[['len()', 'lower()', 'translate()', 'islower()'],
 ['ljust()', 'upper()', 'startswith()', 'isupper'],
 ['center()', 'rfind()', 'isalnum()', 'isdecimal()'],
 ['zfill()', 'index()', 'isalpha()', 'split()'],
 ['strip()', 'rindex()', 'isdigit()', 'rsplit()'],
 ['rstrip()', 'capitalize()', 'isspace()', 'partition()'],
 ['lstrip()', 'swapcase()', 'istitle()', 'rpartition()']]

In [34]:
monte.str.lower()

0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object

But some others return numbers

In [35]:
monte.str.len()

0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

In [36]:
monte.str.startswith('T')

0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool