# Tutorial - String functions in Pandas

### Pandas string functions

This tutorial shows how some useful **Pandas string functions** work, with a few toy examples. The syntax of these functions is typically `s.str.fname(args)`, where `s` is a Pandas series and `args` is a set of *ad hoc* arguments. They are **vectorized functions** which return a series of the same length, in which each term results from applying a string function to the corresponding term of `s`. 

In the first example, I include an **empty string** and a **missing value** for diversity. The `None` entry, which stands for a missing value, could be replaced by `numpy.nan` here.

In [1]:
import pandas as pd

In [1]:
presidents = pd.Series(['Donald Trump', 'Bill Clinton', '', None])
presidents

NameError: name 'pd' is not defined

The function `str.len` returns the length of every element of a string series. The empty string** (`''`) has length zero. Since `None` has no length, a `NaN` value has been returned as the fourth term. The series has then been converted by Python to data type `float`, to cope with that (it would be `int` if all the terms had length).

In [3]:
presidents.str.len()

0    12.0
1    12.0
2     0.0
3     NaN
dtype: float64

In Python, **substrings** can be extracted from a string variable just as we extract elements from a list. This can be useful to manage dates, as shown here.

In [4]:
dates = pd.Series(['2016-10-06', '2015-08-19', '2016-01-30'])
dates.str[:4]

0    2016
1    2015
2    2016
dtype: object

In Python, strings are **joined** just as lists, with the plus sign (`+`):

In [5]:
firstnames = pd.Series(['Marvin', 'Leonard'])
secondnames = pd.Series(['Gaye', 'Cohen'])
firstnames + ' ' + secondnames

0      Marvin Gaye
1    Leonard Cohen
dtype: object

Many methods of string data analysis are based on counting the occurrences of selected terms. Counting is typically preceded by **conversion to lowercase**, which can be done with the function `str.lower`:

In [6]:
students = pd.Series(['Pablo', 'Liudmila', 'Nana Yaa'])
students.str.lower()

0       pablo
1    liudmila
2    nana yaa
dtype: object

The function `str.contains` **detects the presence of a pattern** in the terms of a string series. It returns a Boolean series indicating, term by term, whether the pattern occurs.

In [7]:
students.str.contains(pat='an')

0    False
1    False
2     True
dtype: bool

This can be used to filter out documents. For instance:

In [8]:
students[~students.str.contains(pat='an')]

0       Pablo
1    Liudmila
dtype: object

The symbol `~` stands for the logical operator NOT in the expression between brackets. It turns `True` into `False` and conversely.

The function `str.findall` **extracts matching patterns** from the terms of a string series. It returns, for every term, a list containing all the occurrences of that pattern. This allows for the pattern to occur with different frequency along the original series. The series returned has then data type `object`, which in Pandas means anything that is not Boolean or numeric.

In [9]:
students.str.findall(pat='a')

0             [a]
1             [a]
2    [a, a, a, a]
dtype: object

With the function `str.replace`, you can **replace matched patterns** in a string series. Example:

In [10]:
students.str.replace(pat=' ', repl='-')

0       Pablo
1    Liudmila
2    Nana-Yaa
dtype: object

Although the third argument of `str.replace` (the replacement) has to be a single string, the second argument (the pattern) can be multiple. In the preceding example, we replaced a single white space by a dash. Now, to replace either white space or the letter 'o', we set as the pattern to replace the regular expression `'o| '`. The symbol `|` stands for the logical operator OR, both in Pandas row filters and in the patterns used in Pandas string functions.

In [11]:
students.str.replace(pat='o| ', repl='-')

0       Pabl-
1    Liudmila
2    Nana-Yaa
dtype: object

The function `str.split` **splits up** the terms of a string series. The default for the **split pattern** is one white space. The outcome of the split is a list.

In [12]:
sayings = pd.Series(['Correlation is not causation', 'Flattery is the food of fools'])
sayings.str.split(pat=' ')

0       [Correlation, is, not, causation]
1    [Flattery, is, the, food, of, fools]
dtype: object

### Regular expressions

Quite often, the transformations performed by the methods described above can be simplified by means of **regular expressions**. More specifically, they can be used as the the argument `pat` in the functions`str.contains`, `str.findall`, `str.replace` and `str.split`.

A regular expression is a pattern which describes a collection of strings. Among them, **character classes** are the simplest case. They are built by enclosing a collection of characters in square brackets. The square brackets indicate any of the characters enclosed. For instance, `[0-9]` stands for any digit, and `[A-Z]` for any capital letter. Two simple examples follow.

In [13]:
bio = pd.Series(['I was born in 1954', 'My phone is +34 932 534 200'])
bio.str.replace(pat='[a-z]', repl='x')

0             I xxx xxxx xx 1954
1    Mx xxxxx xx +34 932 534 200
dtype: object

In [14]:
bio.str.replace(pat='[0-9]', repl='x')

0             I was born in xxxx
1    My phone is +xx xxx xxx xxx
dtype: object

Character classes get more powerful when complemented with **quantifiers**. For instance, followed by a plus sign (`+`), a character class indicates a sequence of any length. So, `[0-9]+` indicates any sequence of digits, therefore any number, and '[a-zA-Z]+' indicates any word.

In [15]:
bio.str.replace(pat='[a-zA-Z]+', repl='x')

0             x x x x 1954
1    x x x +34 932 534 200
dtype: object

We can also specify the minimum and maximum length of the sequence:

In [16]:
bio.str.replace(pat='[0-9]{1,3}', repl='x')

0        I was born in xx
1    My phone is +x x x x
dtype: object

In the following example, I add a plus sign to a character class which stands for any alphanumeric character. This is a simple and clean way to transform a string into a **bag of words**:

In [17]:
bio.str.findall(pat='[a-zA-Z0-9]+')

0              [I, was, born, in, 1954]
1    [My, phone, is, 34, 932, 534, 200]
dtype: object

With **wildcards**, you can match pieces of text whose exact content you do not know, other than the fact that they share a common pattern or structure (eg phone numbers or zip codes). The wildcard of Python regular expressions is the **dot** (`.`), which matches any single character (letter, digit, whitespace, etc). A last example follows.

In [18]:
people = pd.Series(['John - male', 'Nancy - female', 'Bruno - male'])
people.str.replace(pat=' - .+', repl='')

0     John
1    Nancy
2    Bruno
dtype: object

### Homework

Some symbols, like `+`, have a special role in regular expressions. You have to be careful when using them in patterns, because they can give unexpected results. The following example illustrates this:

In [19]:
surprise = pd.Series(['I learn C++ and Python. Exciting!', 
  'The formula |x - y| = 7', 'My price is $20'])

In [20]:
surprise.str.contains('$2')

0    False
1    False
2    False
dtype: bool

In [21]:
surprise.str.contains('$')

0    True
1    True
2    True
dtype: bool

The solution is to write them with a **escape character** (`\`):

In [22]:
surprise.str.contains('\$2')

0    False
1    False
2     True
dtype: bool

Find examples, involving the characters `?`, `|`, `+` and `.`, that produce unexpected results.